Category Archives: business

The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

Cutting and pasting about Kodak’s demise

kodakThis graph from Peter Diamandis about how Kodak entered the sinkhole is kind of amazing. Diamandis explains Kodak’s failure to swerve in what is I think the orthodox Silicon Valley analysis:

[in 1996] Kodak had a $28 billion market cap and 140,000 employees.

In 1976, 20 years earlier, Kodak had invented the digital camera. They owned the IP and had the first mover advantage. This is a company that should have owned it all.

Instead, in 2012, Kodak filed for bankruptcy, put out of business by the very technology they had invented.

What happened? Kodak was married to the “paper and chemicals” (film development) business… their most profitable division, while the R&D on digital cameras was a cost center. They saw the digital world coming on, but were convinced that digital cameras wouldn’t have traction outside of the professional market. They certainly had the expertise to design and build consumer digital cameras — Kodak actually built the Apple QuickTake (see photo), generally considered the world’s first consumer digital camera. Amazingly, Kodak decided they didn’t even want to put their name on the camera.

There is more of the same (2012 and before) in the MIT Technology review. This is a totally convincing story (it had me convinced), but it leaves out three things:

  1. the boring old chemicals division of Eastman Kodak which was spun off in 1993 (three years earlier) is still around, profitable ( $10B/year revenue) and dwarfs what’s left of the original company,
  2. and  Fuji films, Kodak’s also ran in chemical film space managed to leap over the sinkhole and prosper, but not by relying on digital cameras.
  3. Kodak did briefly become a market leader in digital cameras but ran into a more fundamental problem.

Back in 1996, the business press and analysts thought Kodak was doing the right thing by divesting its chemical business.

NEW YORK — Eastman Kodak Co., struggling against poor profit and high debt, Tuesday took a big step in its corporate restructuring, announcing that it will divest Eastman Chemical Co. and in one fell swoop wipe out $2 billion of debt.

Such a spinoff would not have occurred just a few years ago, analysts said, and the move signals that Chief Executive Kay R. Whitmore is responding to new, tougher markets and stockholder pressure to improve financial results quickly.

“They are now recognizing that they are not a growth company, that they must go through this downsizing,” analyst Eugene Glazer of Dean Witter Reynolds said in an interview on CNBC.

Kodak’s shares, up sharply Monday in anticipation of the announcement, ended down $1.375 to $52.375 on the New York Stock Exchange.

[…] “We determined that there was little strategic reason related to our core imaging and health business for Kodak to continue to own Eastman,” Whitmore said at a news conference.

Kodak, best known for photography products but also a major pharmaceutical and chemicals group, has endured slow growth for years. Its photography business has been hit by changing demographics, foreign rivals and new technologies such as camcorders. Whitmore said Kodak sales, especially in photography and imaging, were weak.

[…] Costs will be reduced elsewhere in the company, Whitmore said. Other executives also have said Kodak will cut spending on research and development of new products.

In retrospect, dumping the cash generating parts of the business and cutting R&D was  not the best plan even if Wall St. analysts loved the idea. But it’s easy to be a genius after the fact as Willy Shih points out:

Responding to recommendations from management experts, from the mid-1990s to 2003 the company set up a separate division (which I ran) charged with tackling the digital opportunity. Not constrained by any legacy assets or practices, the new division was able to build a leading market share position in digital cameras — a position that was essentially decimated soon thereafter when smartphones with built-in cameras overtook the market.

Yes, those camera phones – which not too many people saw coming in the 1990s. Not only that, but Kodak’s path in digital imaging was not obvious.

The transition from analog to digital imaging brought several challenges. First, digital imaging was based on a general-purpose semiconductor technology platform that had nothing to do with film manufacturing — it had its own scale and learning curves. The broad applicability of the technology platform meant that it could be scaled up in numerous high-volume markets (such as microprocessors, logic circuits, and communications chips) apart from digital imaging. Suppliers selling components offered the technology to anyone who would pay, and there were few entry barriers. What’s more, digital technology is modular. A good engineer could buy all the building blocks and put together a camera. These building blocks abstracted almost all the technology required, so you no longer needed a lot of experience and specialized skills.

Semiconductor technology was well outside of Kodak’s core know-how and organizational capabilities. Even though the company invested lots of money in the basic research and manufacturing of solid-state semiconductor image sensors and developed some notable inventions (including the color filter array that is used on virtually every color image sensor), it had little hope of being a competitive volume supplier of image sensor components, and it was difficult for Kodak to offer something distinctive.

And Shih, perhaps unintentionally, reinforces Diamandis’s point that the top company managers failed to face up to the problem.

For many managers of legacy businesses, the survival instinct kicked in. Some who had worked at Kodak for decades felt they were entitled to be reassigned to the new businesses, or wished to control sales channels for digital products. But that just fueled internal strife. Kodak ended up merging the consumer digital, professional, and legacy consumer film divisions in 2003. Kodak then tried to make inroads in the inkjet printing business, spending heavily to compete with fortified incumbents such as HP, Canon, and Epson. But the effort failed, and Kodak exited the printer business after it filed for Chapter 11 bankruptcy reorganization in 2012.

Management chaos and “spending heavily to compete with fortified incumbents”.


With the benefit of hindsight, it’s interesting to ask how Kodak might have been able to achieve a different outcome. One argument is that the company could have tried to compete on capabilities rather than on the markets it was in. This would have meant directing its skills in complex organic chemistry and high-speed coating toward other products involving complex materials — a path followed successfully by Fuji. However, this would have meant walking away from a great consumer franchise. That’s not the logic that managers learn at business schools, and it would have been a hard pill for Kodak leaders to swallow.

it would have been a hard pill for Kodak leaders to swallow.

But wasn’t that their job? So to conclude this exercise in cut and paste, what about Fuji? The Economist had an interesting take:

 the digital imaging sector accounts for only about one-fifth of Fujifilm’s revenue, down from more than half a decade ago.

How Fujifilm succeeded serves as a warning to American firms about the danger of trying to take the easy way out: competing through one’s marketing rather than taking the harder route of developing new products and new businesses. […]

Like Kodak, Fujifilm realised in the 1980s that photography would be going digital. Like Kodak, it continued to milk profits from film sales, invested in digital technologies, and tried to diversify into new areas. Like Kodak, the folks in the wildly profitable film division were in control and late to admit that the film business was a lost cause. As late as 2000 Fujifilm counted on a gentle 15 or 20-year decline of film—not the sudden free-fall that took place. Within a decade, film went from 60% of Fujifilm’s profits to basically nothing.

If the market forecast, strategy and internal politics were the same, why the divergent outcomes? The big difference was execution.

Fujifilm realised it needed to develop in-house expertise in the new businesses. In contrast, Kodak seemed to believe that its core strength lay in brand and marketing, and that it could simply partner or buy its way into new industries, such as drugs or chemicals. The problem with this approach was that without in-house expertise, Kodak lacked some key skills: the ability to vet acquisition candidates well, to integrate the companies it had purchased and to negotiate profitable partnerships. “Kodak was so confident about their marketing capability and their brand, that they tried to take the easy way out,” says Mr Komori.

Fujifilm realised it needed to develop in-house expertise in the new businesses.



The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

MiFID II, GPS and UTC time

I have a post up on FSMLabs web site about the use of GPS and other satellite time for MiFID II timestamp compliance.  It’s fascinating how much effort has recently gone into trying to convince people that MiFID II will require direct time from a national lab or certified via a national lab despite the clear wording in MiFID II proposed regulations. To me, the deal is sealed in the Cost Benefit Analysis in which the ESMA regulators write

“The final draft RTS also reduces costs of the initial draft RTS proposed in the CP by allowing UTC disseminated via satellite systems (i.e. GPS receiver or the use of other satellite systems when available)”

That is not a promise one can easily walk away from. ESMA justifies the regulations with a cost/benefit analysis in which the costs for time stamping are limited by license to use GPS time.  Of course, legal reasoning and logic are not always the same, but I’m trying to figure out how ESMA regulators could claim that they didn’t mean it, or why they would have such a motivation.


 “South Sea Bubble” by Edward Matthew Ward via href=”” Wikimedia

MiFID2 and security – keeping track of the money


A shorter version of this post is on the  FSMLabs web site.  MiFID2 is a new set of regulations for the financial services industry in Europe that includes a much more rigorous approach to timestamps.  Timestamps are in many ways the foundation for data integrity in modern processing systems – which are distributed, high speed, and generally gigantic. But when regulations or business or other constraints require timestamps to really work, the issues of fault tolerance and security come up. It doesn’t matter how precise your time distribution is if a mistake or a hacker can easily turn it off or control it.

TimeKeeper incorporates a defense-in-depth design to protect it from deliberate security attacks and errors due to equipment failure or misconfiguration. This engineering approach was born out of a conviction that precise time synchronization would become a business and regulatory imperative.KeystoneCops

  1. Recent disclosures of still more security problems in the NTPd implementation of NTP show how vulnerable time synchronization can be without proper attention to security. PTPd and related implementations of the PTP standard have similar vulnerabilities.
  2. Security and general failure tolerance should be on the minds of firms that are considering how to comply with the MiFID2 rules because time synchronization provides both a broad attack surface and a single point of failure unless properly implemented.

The first step towards time non-naive time synchronization is a skeptical attitude on the parts of IT managers and developers. Ask the right questions at acquisition and design time to prevent unpleasant surprises later.

One of the most dangerous aspects of the just disclosed NTPd exploit is that NTPd will accept a message from any random source telling it to stop synchronizing with its actual time sources. Remember, NTPd is an implementation of NTP, other implementations may not suffer from the same flaw. That d is easy to overlook, but it’s key. TimeKeeper’s NTP and PTP implementations will, for example, ignore commands that do not come from the associated time source and will apply analytical skepticism to commands that do appear to come from the source. TimeKeeper dismisses many of these types of attacks immediately and will start throwing off alerts to provoke automated and human counter-measures. The strongest protection TimeKeeper offers, however, comes from its multi-source capabilitiesthat allow it to compare multiple time sources in real-time and reject a primary source that has strayed.

Correct time travels a long, complex path from a source such as a GPS receiver or a feed like the one British Telecom is now providing. Among the questions system designers need to ask are the following two.

  1. Is the chain between source and client safeguarded comprehensively and instrumented end-to-end?
  2. Is there a way of cross-checking sources against other sources and rejecting bad sources?

Without positive answers to both of these questions, the time distribution technology is inherently fragile and robust MiFID2 timestamp compliance will be unavailable.

The painting is: “Quentin Massys 001” by Quentin Matsys (1456/1466–1530) – The Yorck Project: 10.000 Meisterwerke der Malerei. DVD-ROM, 2002. ISBN 3936122202. Distributed by DIRECTMEDIA Publishing GmbH.. Licensed under Public Domain via Commons – 

Annals of unintentional irony

Come back after a trip to see Marc Andreessen’s team of twitter posters complaining wryly about how government is so gosh durn big and citing experts like Milton Friedman.

That is, Marc Andreesen, the former National Center for Supercomputing Applications programmer who helped write the Mosaic Web Browser for the WWW that CERN scientists developed on top of the Internet technology that DARPA and NSF researchers and bureaucrats created on top of prior government funded technology like integrated circuits, and then built a company that made him rich with the team that worked together at NCSA,  engaging in the typical rich person bitching about the gummint.  Which shows how little reality matters to people.

Ikea and RedHat

This is state of the art for systems software now – which is not all that impressive.

Glantz explained that Ikea has more than 3,500 Red Hat Enterprise Linux (RHEL) servers deployed in Sweden and around the world. With Shellshock, every single one of those servers needed to be patched and updated to limit the risk of exploitation. So how did Ikea patch all those servers? Glantz showed a simple one-line Linux command and then jokingly walked away from the podium stating “That’s it, thanks for coming,” as the audience erupted into boisterous applause. On a more serious note, Glantz said that it took approximately 2.5 hours to test, deploy and upgrade Ikea’s entire IT infrastructure to defend against Shellshock.  Eweek.

Gilbert and Sullivan’s innovative business model

From the Financial Times: (and you should buy a subscription)

Piracy is a problem as old as the music industWilliam Schwenck Gilbert, Arthur Sullivan - The Pirates of Penzance - (Sheet music)ry itself. In Victorian times, it was illicitly copied sheet music that was the avowed enemy of the artist, and the operetta team Gilbert and Sullivan paid toughs to go round London pubs smashing up pianos with sledge hammers whenever they found bootlegged scores.



Space-X patents

Anderson: So what have all your creative people come up with, then? What’s different in your basic technology versus 50 years ago?

Musk: I can’t tell you much. We have essentially no patents in SpaceX. Our primary long-term competition is in China—if we published patents, it would be farcical, because the Chinese would just use them as a recipe book. [Wired]

This is probably a smart idea, but it illustrates the advantages of a working patent system. The inventions and advances that Space-X develops are kept secret. Engineers and scientists around the world can’t look at what they did, think of alternatives or better processes, or license technology and add new innovations on top of it. Without a working patent system, innovators have to obscure what they discover and, as Musk does, say very little. This slows down the progress of science.

Many of the critics of the patent system have a peculiar idea that there is some powerful advantage to being the first to market for a new idea.  There is not. If Space-X gave its recipes away, Chinese and European companies would copy and cut into their market.  Maybe eventually US companies would do the same thing (or put flags on something made in China and mark it up). Anyone who thinks that Space-X could successfully sell rockets that were equivalent to or even not a lot better than those being sold by Lockheed has no idea how markets work. There is a nice sounding myth about how “agile” and “innovative” producers will by some magic be able to outcompete larger companies that copy their work and have far greater marketing, production, and distribution systems (and better political connections). That’s not how hardball works.





How business works in the real world: Apple and GT and Beelzebub

GT Advanced declared bankruptcy and blamed Apple for its problems. Apple called GT Advanced’s story “defamatory”.  I have no idea about the specifics in this case but I do know about  big companies pushing insanely onerous and self-defeating terms on small ones.  Here’s the original claim by GT Advanced:

At the start of negotiations, Apple offered to buy 2,600 sapphire growing furnaces from GT Advanced, which GT Advanced would operate on behalf of Apple, the “ultimate technology client to land,” according to Squiller.

“In hindsight, it is unclear whether Apple even intended to purchase any sapphire furnaces from GTAT,” he wrote.

But after months of hard negotiating, Apple offered a deal under which it would shift away economic risk by lending GT Advanced the money to build the furnaces and grow the sapphire, and then sell it exclusively to Apple for less than market value, Squiller wrote.

GT Advanced was effectively forced to accept the unfair deal in October 2013 because its intense negotiations with Apple had left it unable to pursue deals with other smartphone makers, he said.

Punchinello_Mayor_HallBack when we were selling a real-time OS, I contacted an ex-boss who now had a high position in a telecommunications company to see if he could help us sell into it. His response was “don’ t touch this place, it is expert in destroying small vendors.” And, whatever the actual story with GT and Apple, the storyline is not at all unusual. The elements of a smaller company spellbound by prospects of a huge deal/giant customer, followed by time consuming negotiations, followed by onerous demands – been there, done that. We once were negotiating with a huge semiconductor company about a big deal that, over time, got worse and worse for us. We “finalized” with some terms that we thought might be survivable. And then the semiconductor company negotiators, one of whom by this point our negotiators were privately calling “Beelzebub”, announced they had to take the deal “to management” and came back with much more absurd demands. We were able to walk away but we saw other small companies make deals with Beelzebub and then fail. There is a strong impulse in some big companies, among some business units, to squeeze small company vendors way too far.

Once, after a sales visit to a big Wall Street Firm, two experienced sales people for a second supplier told me they were shocked by what I had said to the customer and advised me never to make the same error. What was my mistake? I had told the customer that we had produced a product that worked a lot better than what they were currently using, cost them a lot less, and was highly profitable to us. “Never do that”, counseled my colleagues, “they want to believe that, at best, you are breaking even on the deal, otherwise they think they left money on the table.”

The absurdity of these kinds of “negotiations” is that they are highly unprofitable for big companies. The potential savings are usually negligible for the bigger company. Negotiating purchase agreements is expensive for big companies. If they are even in the position of dealing with a small vendor, there must be some compelling business reason for getting whatever the small vendor is selling. This also means that a deal that would be unprofitable, perhaps damaging, to the vendor would put a critical part of the supply chain at risk for the purchaser. But instead of closing the deal and moving on, some purchasing groups in some companies want to grind the vendor down to “show value” to their management or maybe even just out of habit. Sometimes this requires the smaller company to walk away, sometimes to go over the heads of the purchasing group to business units waiting for the product. We’ve probably lost some deals that would have worked out well in the end by refusing to accept onerous terms, but we’ve also walked away from deals that would have been fatally unprofitable. The good thing is that walking away from an overbearing big customer is usually a good starting point for a sales effort aimed at its competitors.