What happens when you do not use enterprise quality technology.

News from last year.
Aug. 26, 2013 11:5primary_SafetyLastStillClock2 a.m. ET

A glitch in time synchronization caused German exchange operator Deutsche BoerseAG DB1.XE -0.16% ‘s Eurex Exchange arm to halt trading for slightly more than an hour early Monday, the latest in a thread of technical issues at global exchanges.

The Eurex derivatives trading market was halted at 0620 GMT, 20 minutes after trading started, “in order to protect the integrity of the market,” Deutsche Boerse said.

“An incorrect time synchronization within the system” triggered the market halt. The problem was solved, pre-trading started at 0720 GMT and, as of 0730 GMT all products were again tradable, the exchange said.

A person close to the matter said the glitch was caused by a problem with the system clock, not extreme data load, noting that current trading volumes are far below previous peaks.

Wall Street Journal

From what we understand, the problem was due to a number of GPS time servers that dropped leap year adjustments, so jumping the time back 36 seconds. Then PTP2 software adjusted server time to jump backward 36 seconds. It’s not the first failure of this sort (or the last). The legacy technology for time synchronization lacks the cross check, failover, and alerting that are built into TimeKeeper. That technology was not designed with enterprise in mind and although it has been heavily modified over time, it is very difficult to engineer a solution to basic architectural mismatch. For example, TimeKeeper is architected to treat all time sources the same, but software like PTPd and NTPd is designed for a specific protocol. So TimeKeeper can use a feed from one protocol to crosscheck another, but that requires a clumsy grafting process in one of the protocol specific time synchronization programs.  As another example, the PTP standard has a completely inappropriate fault-tolerance technique baked into the standard – a technique that is a holdover from the PTP standard origin in device control. The standard was designed for systems with really simple networks and time consumers. A single cable with some welding machines on the end is perhaps a typical intended use case. The idea was that the time sources would advertise how good they were and the consumers would simply pick the one that said it was the best. This is an absurd policy for an enterprise network with super-powerful server computers receiving time packets across complex networks dotted with routers and switches. TimeKeeper was designed to ignore this “best master” protocol for fault tolerance and to utilize the smarts of the consumers to detect problems in the feed and to select among alternatives.

 

 

Fischer Lynch Patterson and timeouts

There is a widely cited (over 1400 cites in CiteseerX ) result called the Fischer-Lynch-Patterson text_message_from_godot-469763theorem about consensus – a key issue in distributed databases or any system where data is either distributed or replicated or both.

In this paper, we show the surprising result that no completely asynchronous
consensus protocol can tolerate even a single unannounced process death. We do
not consider Byzantine failures, and we assume that the message system is reliable it
delivers all messages correctly and exactly once. Nevertheless, even with these
assumptions, the stopping of a single process at an inopportune time can cause any
distributed commit protocol to fail to reach agreement. Thus, this important
problem has no robust solution without further assumptions about the computing
environment or still greater restrictions on the kind of failures to be tolerated!
Crucial to our proof is that processing is completely asynchronous; that is, we
make no assumptions about the relative speeds of processes or about the delay
time in delivering a message. We also assume that processes do not have access to
synchronized clocks, so algorithms based on time-outs, for example, cannot be
used. (In particular, the solutions in [6] are not applicable.) Finally, we do not
postulate the ability to detect the death of a process, so it is impossible for one
process to tell whether another has died (stopped entirely) or is just running very
slowly.

The Wikipedia summary is similar:

In a fully asynchronous system there is no consensus solution that can tolerate one or more crash failures even when only requiring the non triviality property.[11] This result is sometimes called the FLP impossibility proof. The authors Michael J. Fischer, Nancy Lynch, and Mike Paterson were awarded a Dijkstra Prize for this significant work. The FLP result does not state that consensus can never be reached: merely that under the model’s assumptions, no algorithm can always reach consensus in bounded time. In practice it is highly unlikely to occur.

As far as I can tell the “surprising” result is that unbounded delays are not bounded. Without timeouts, we can not distinguish between a site B that will eventually send a message to site A and a site B that has failed (crashed) and will never send a message.

 

 

Why is clock synchronization so important in big data

Distributed transactions have historically been implemented by the database community in the manner pioneered by the architects of System R* [22] in the 1980s. The primary mechanism by which System R*-style distributed transactions impede throughput and extend latency is the requirement of an agreement protocol between all participating machines at commit time to ensure atomicity and durability. To ensure isolation, all of a transaction’s locks must be held for the full duration of this agreement protocol, which is typically two-phase commit.

The problem with holding locks during the agreement protocol is that two-phase commit requires multiple network round-trips between all participating machines, and therefore the time required to run the protocol can often be considerably greater than the time required to execute all local transaction logic. If a few popularly accessed records are frequently involved in distributed transactions, the resulting extra time that locks are held on these records can have an extremely deleterious effect on overall transactional throughput. [ Calvin: Fast Distributed Transactions for Partitioned Database Systems. Alexander Thomson et al]

This is one reason why we have all the new databases without transaction support – because high transaction rates in a distributed environment (e.g. web or IOT applications in the Cloud) cannot be scaled in face of lock overhead.

 

Patents considered harmful, by some.

That’s not how it looks from here but I think part of the muddiness in the software patent argument is the result of arguments that really attack the entire idea of patents but are advanced as being specific to software patents. The whole idea of a patent is that you do foster innovation and competition by “handing out monopolies”. Maybe that idea is wrong, but if you think it is wrong you are likely opposed to patents – period.  Tim Lee and many of the other

Famous anti-innovation patenter.

Famous anti-innovation patenter.

people who dislike software patents confuse the issue by simultaneously claiming that (1) software patents are inherently worse/different than other patents and (2) that software patents choke off innovation because of properties that turn out to apply to all patents. Opposing patents in general, however, is a more radical proposition than opposing software patents – perhaps more a more radical proposition than people feel comfortable making.

My position is that many patents are wrongly granted for “inventions” that are neither novel nor un-obvious and that the system for adjudicating patents is way too slow, error prone, and expensive. But patents themselves serve a useful purpose.  The obvious example is Excel which is effectively a monopoly without the benefit of any patents at all. The work of the innovators was, without patent protection, rapidly copied by companies with better marketing and greater resources. Innovation stopped. End of story.

And “business method” patents, in general, are not really software or computer patents at all. Usually they are efforts to patent a well known method of doing business adding some technology to the mix to buttress a claim of novelty. One could have similarly claimed a hundred years ago that making sales calls via telephone was an invention or that delivering adverts by TV instead of by radio was an invention.

 

A claimed validated operating system.

The claim: we have demonstrated the comprehensive formal verification of the seL4 microkernel, with a complete proof chain from precise, formal statements of high-level security and safety properties to the  binary executable code. GD

The L4 base is useful –  we advocated a similar approach with RTLinux which was, um, very similar to L4. It looks like the L4 version here uses the interrupt emulation method at the heart of RTLinux (of course, without any attribution or reference). Just for the record, here’s a much earlier RTLinux based effort.

As for the verification, I am highly skeptical of the claim. Here’s the claimed proof and the research paper.  It’s not at all clear exactly what was validated, but from the paper it looks like the 8000 odd lines of l4 microkernel were shown to provide the functionality described in the Hoare logic specification. No device drivers appear to have been validated.

Cassandra

kly and kass

Cassandra’s unfortunate end.

Cassandra is quite interesting – and time sync seems increasingly critical to correct operation. Here are some resources:

A paper on the storage model from developers at Facebook.

A reasonably clear introduction from IBM Developer Works.

A big difficulty for people who are, like me, more familiar with the relational databases is that Cassandra and similar DBs are concerned about a very different kind of computation/storage platform and IT environment (different kind of technical user/programmer).

 

A hard theorem becomes easy over time.

IMG_20140910_141117_570Can it really be this simple?

Suppose we have simple programming language where variables represent non-negative integers and where the range of integers represented is limited only by available memory. This language has the usual arithmetic operators and recursion.  We also need to have the ability to pass functions as parameters. If we define f by f(x){ x(2)}  then f(g) = g(2). A function passed as a parameter is just a number that might be a pointer to function code or  the code itself, it doesn’t matter here. Now suppose we put this language on a computer with an infinite memory so that the integer variables can represent arbitrary integers, no matter how large, and function execution  never “overflows the stack”. Functions written in this language and executed on this imaginary computer may get stuck in infinite loops or in computations that never succeed. For example, Forever(x){ if(x< x+1)then return Forever(x+1) else return 1} never terminates because the integers are an infinite set. A function Solve(f,n){ if( f(n) = 0)return n else return Solve(f,n+1) } will just keep trying larger and larger n’s forever if there is no solution for a particular f and initial n.  It would be convenient if we could define a function “Terminates” so that Terminates(f,x)=1 if f(x) terminates (returns a value ) and Terminates(f,x)=0 if f(x) gets stuck. Of course  Terminates itself should always return some value and never get stuck in a loop. But we can’t write Terminates and that’s easy to prove.  If we could write Terminates as specified we could define a function G(f){ return Terminates(f,f) } which just tests to see if f(f) terminates. Remember, functions are just numbers.  But we could also write a sneaky function like this:

Diagonal(f){ if( Terminates(f, f)=1) then return Forever(0) else return 1}

If Terminates(f,f) =1  then Diagonal(f) calls Forever(0) which never returns. If Terminates(f,f)=0 then Diagonal(f) returns 1. If Terminates does what it is supposed to do, Diagonal(f,f) terminates if and only if f(f) does not terminate. But Diagonal is itself a function so we can ask the system to evaluate Diagonal(Diagonal) and this leads to a paradox.  If Terminates(Diagonal,Diagonal)=1  then Diagonal(Diagonal) will call Forever(0) and will not terminate. That is, if Terminates tells us Diagonal(Diagonal) does terminate, then Diagonal(Diagonal) does not terminate.  The only way to resolve this paradox is to admit that Terminate cannot work as specified. And notice that this construction of Diagonal doesn’t use many complex language features – Diagonal just needs Terminates, recursion, if/else and the ability to treat code as data (functions as integers).

For mathematics, this result has all sorts of interesting implications. For example, if it were not true, we could, in principal, write a function that tested numbers to see if they were odd perfect numbers and marched up the positive integers until it found one. If that function could be evaluated by Terminates, we’d know if there was such a number – something nobody seems to know right now. Any theorem Exists x, P(x) where P can be represented in  our language could be tested by evaluating Terminates on a function that tests P on consecutive integers until it finds a solution. Any theorem Forall x P(x) could be evaluated by seeing if a program that looks for a counter-example ever terminates. All that should indicate that Terminates is implausible as well as impossible. For computer science the implications are more subtle because the fact that we don’t really have computers with infinite memory matters in computer science.

 

From Jersey to Wall Street – or the equivalent

cartaretA common configuration for FSMLabs TimeKeeper customers is to cross-couple time sources in New Jersey and New York City or London and Slough or Chicago and Aurora or Singapore and Sidney- any two trading locations that are connected with high quality network. Sometimes the network connection does not even have to be that great. TimeKeeper will cross-check time sources, complain when things look wrong, and failover when needed.  Multiple time sources also produces a timestamp provenance. Trading applications will have a record showing that the  timestamps they produce were in agreement with two or more independent sources. A number of firms scale this basic design to multiple sites: increasing the depth of fault-tolerance and the strength of the provenance. Cross-coupling time feeds also  provides early warning on a number of network problems. Several customers saw TimeKeeper warnings about secondary sync  and found on investigation that their network providers were changing equipment or rerouting without notice.

 

 

 

Real-time Linux

My opinion has always been that the Linux-RT project was based on an unfixable engineering error.

 

A few words on the status and the future of RT:
-----------------------------------------------

The situation since last years RTLWS (https://lwn.net/Articles/572740/)
has not improved at all, it's worse than before. 

While shortly after RTLWS quite some people promised to whip up proper
funding, nothing has materialized and my personal situation is worse
than before.

I'm really tired of all the politics involved, the blantant lies and
the marketing bullshit which I have to bear. I learned a few month ago
that a certain kernel vendor invented most of RT anyway and is the
expert in this field, so the customers dont have to worry about my
statements.

Just for the record: The initial preempt-RT technology was brought to
you mostly by Ingo Molnar, Steven Rostedt, Paul Mckenney, Peter
Zijlstra and myself with lots of input from Doug Niehaus, who
researched full in kernel preemption already in the 1990s. The
technology rewrite around 3.0-rt was done by me with help from Peter
and Steven, and that's what preempt-RT today is based on.

Sure, people can believe whatever marketing bullshit they want, but
that doesn't make the truth go away. And the truth is, that those who
claim expertise are just a lying bunch of leeches.

What really set me off was the recent blunt question, when I'm going
to quit. What does this mean? Is someone out there just waiting that I
step down as preempt-RT maintainer, so some corporate entity can step
up as the saviour of the Linux RT world? So instead of merily leeching
someone seeks active control over the project. Nice try.
http://lwn.net/Articles/604632/

The Horrors of Software Patents

US Patent Number  4701722 A is a perfect example of everything software patent opponents hate about software patents:

  • It implements  mathematical functions that are pretty well known.
  • It covers a process of changing information not changing any physical object one could touch.
  • It is owned by a company with a business model depending on licensing fees, not manufacturing things: What many people would call a “patent troll”

US 4701722 A is is one of the basic patents of Dolby B noise reduction awarded to Ray Dolby back in 1987. It’s not a software patent, it is a circuit patent.  The usual arguments about why software should not be subject to patenting make little sense. If a circuit can be patented, then it should be possible to patent a program. The beginning of the  study of computation as a mathematical subject was the discovery that a simple general purpose engine reading instructions from some sort of store can emulate any circuit. The early computers were “programmed” by physically moving wires around. The photo below shows ENIAC being programmed. A “program” then was obviously a circuit design. Technology advanced to that these circuits could be stored in memory and on disk drives. But that did not change the basic process – writing software is conceptually the same as wiring up an electrical circuit.

eniac

And around the same time Nyquist/Shannon showed that analog and digital signals could be equivalent to each other.  Ray Dolby knew, and noted in this patent, that the analog signals transformed by his invention could be transformed into digital information and transformed back – as convenient. If there is an intrinsic flaw in the concept of software patents themselves, something fundamentally distinct in software that makes software inventions impossible, then the critics of software patents have failed to explain what that could be. See also Martin Goetz which I found via the anti-software patent arguments (1 and 2) of Tim B. Lee (who is not Tim Berners-Lee).