Telecom is 10 years behind wall street on time synchronization

This is an interesting paper, but Telecom has not yet come to grips with the problems and advantages of fast shared commodity ethernet interconnect.

North American service providers are in the process of upgrading their radio access networks with next generation LTE equipment. They arefinalizing a 4G rollout that involves highly stringent timing requirements, but in many cases theyare relying on sole-source synchronization byusing Global Navigation Satellite System (GNSS). Natural occurring disturbances, as well as unintentional radio frequency jamming, intentional jamming, and spoofing, make GNSS vulnerable to interference.

This article presents a novel approach for addressing the issue of GNSS vulnerability by introducing a standard means of providing a redundant packet-based synchronization source for LTE base stations. It also describes how this new approach can mitigate noise caused by asymmetry and transit delay variation in packet networks.

commag-pearson-rev2colonial-williamsburg

Project Roseline

Accurate and reliable knowledge of time is fundamental to cyber-physical systems for sensing, control, performance, and energy efficient integration of computing and communications. This simple statement underlies the RoseLine project.  Emerging CPS [Cyber Physical Systems - vy]  applications depend on precise knowledge of time to infer location and control communication.  There is a diversity of semantics used to describe time, and quality of time varies as we move up and down the system stack.  System designs tend to overcompensate for these uncertainties and the result is systems that may be over designed, in-efficient, and fragile.  The intellectual merit derives from the new  and fundamental concept of time and the holistic measure of quality of time (QoT) that captures metrics including resolution, accuracy, and stability.  

The project will build a system stack that enables new ways for clock hardware, OS, network services, and applications to learn, maintain and exchange information about time, influence component behavior, and robustly adapt to dynamic QoT requirements, as well as to benign and adversarial changes in operating conditions.  Application areas that will benefit from Quality of Time will include: smart grad, networked and coordinated control of aerospace systems, underwater sensing, and industrial automation.  The broader impact of the proposal is due to the foundational nature of the work which builds a robust and tunable quality of time that can be applied across a broad spectrum of applications that pervade modern life. Roseline.

Why is computer science education so horrible

This is a post about CS education. It is prompted by a seriesofposts by Mark Guzdial in which he criticizes the pervasive belief among CS educators that when it comes to programming, there’s not much an instructor can do: some students get it, others don’t; it’s all in genetics or other external factors that CS educators can’t influence (aka the Geek Gene Hypothesis). I am with Mark on this, but I feel a bit stronger about it than he does. I think that attitude is bullshit, and the studies supporting it are unsound by means of making conjectures ignoring an enormous number of confounding factors.  Cristina (Crista) Videira Lopes

A lot of it is not even teaching method, it is  attitude.  When CS was “professionalized” in the 1970s, all the stupid gatekeeper ambitions and pointless competition from the traditional “hard sciences” and engineering were imported.  When I taught intro CS, I found that the most important part of my walljob was sustaining the morale of the students who came in without programming background (and many bad ideas).

Harvey Mudd, as Vidiera points out, has actually tried to address these problems and

Our instructors had private conversations with students who were using up a disproportionate amount of air time in class talking about arcane details and asked them to have those conversations with the instructors in private because other students found their level of knowledge intimidating. [US News]

Not being as nice, smart, or hardworking as the people as Harvey Mudd, I used to just tell those students to shut up and learn something in my class. And I spent a lot of time telling other students not to worry, that the “experts” didn’t actually know anything near what they thought they did. The common attitude among CS teachers, who are generally former arcane-air-time students themselves, is quite different.

JP Morgan Security Breach

“Faced with the rising threat of online crime, JPMorgan has said it plans to spend $250 million on digital security annually, but had been losing many of its security staff to other banks over the last year, with others expected to leave soon.” – New York Times. 

Why is it that NY banks can spend so much money on computing infrastructure and find themselves short of actual talent so frequently? There are very sharp people working in finance technology, but …

Paths versus Recursion

/*
 * Iterative DepthFirst file list (c) Victor Yodaiken 2013
 *
 * "Not the way we do it in Brooklyn" - Dave "Kinch" Arnow. 
 *
 *
 * Data structure is P - the current path, with some aux data
 *
 * Two basic operations:
 * 1) Lp(P) - starts at path P and extends it to the leftmost
 * reachable file/directory
 * 2) IterateDF(P) iterates by advancing a path to the next in depth 
 * first order
 *
 * So program is
 *
 * Initialize(P);
 * do{
 * PrintPath(P);
 * }while(IterateDF(P) != EOF)
 *
 *
 * Horribly inefficient - can be cured by caching positions in directory
 * entries.
 *
 * */
Link to Code

What happens when you do not use enterprise quality technology.

News from last year.
Aug. 26, 2013 11:5primary_SafetyLastStillClock2 a.m. ET

A glitch in time synchronization caused German exchange operator Deutsche BoerseAG DB1.XE -0.16% ‘s Eurex Exchange arm to halt trading for slightly more than an hour early Monday, the latest in a thread of technical issues at global exchanges.

The Eurex derivatives trading market was halted at 0620 GMT, 20 minutes after trading started, “in order to protect the integrity of the market,” Deutsche Boerse said.

“An incorrect time synchronization within the system” triggered the market halt. The problem was solved, pre-trading started at 0720 GMT and, as of 0730 GMT all products were again tradable, the exchange said.

A person close to the matter said the glitch was caused by a problem with the system clock, not extreme data load, noting that current trading volumes are far below previous peaks.

Wall Street Journal

From what we understand, the problem was due to a number of GPS time servers that dropped leap year adjustments, so jumping the time back 36 seconds. Then PTP2 software adjusted server time to jump backward 36 seconds. It’s not the first failure of this sort (or the last). The legacy technology for time synchronization lacks the cross check, failover, and alerting that are built into TimeKeeper. That technology was not designed with enterprise in mind and although it has been heavily modified over time, it is very difficult to engineer a solution to basic architectural mismatch. For example, TimeKeeper is architected to treat all time sources the same, but software like PTPd and NTPd is designed for a specific protocol. So TimeKeeper can use a feed from one protocol to crosscheck another, but that requires a clumsy grafting process in one of the protocol specific time synchronization programs.  As another example, the PTP standard has a completely inappropriate fault-tolerance technique baked into the standard – a technique that is a holdover from the PTP standard origin in device control. The standard was designed for systems with really simple networks and time consumers. A single cable with some welding machines on the end is perhaps a typical intended use case. The idea was that the time sources would advertise how good they were and the consumers would simply pick the one that said it was the best. This is an absurd policy for an enterprise network with super-powerful server computers receiving time packets across complex networks dotted with routers and switches. TimeKeeper was designed to ignore this “best master” protocol for fault tolerance and to utilize the smarts of the consumers to detect problems in the feed and to select among alternatives.

 

 

Fischer Lynch Patterson and timeouts

There is a widely cited (over 1400 cites in CiteseerX ) result called the Fischer-Lynch-Patterson text_message_from_godot-469763theorem about consensus – a key issue in distributed databases or any system where data is either distributed or replicated or both.

In this paper, we show the surprising result that no completely asynchronous
consensus protocol can tolerate even a single unannounced process death. We do
not consider Byzantine failures, and we assume that the message system is reliable it
delivers all messages correctly and exactly once. Nevertheless, even with these
assumptions, the stopping of a single process at an inopportune time can cause any
distributed commit protocol to fail to reach agreement. Thus, this important
problem has no robust solution without further assumptions about the computing
environment or still greater restrictions on the kind of failures to be tolerated!
Crucial to our proof is that processing is completely asynchronous; that is, we
make no assumptions about the relative speeds of processes or about the delay
time in delivering a message. We also assume that processes do not have access to
synchronized clocks, so algorithms based on time-outs, for example, cannot be
used. (In particular, the solutions in [6] are not applicable.) Finally, we do not
postulate the ability to detect the death of a process, so it is impossible for one
process to tell whether another has died (stopped entirely) or is just running very
slowly.

The Wikipedia summary is similar:

In a fully asynchronous system there is no consensus solution that can tolerate one or more crash failures even when only requiring the non triviality property.[11] This result is sometimes called the FLP impossibility proof. The authors Michael J. Fischer, Nancy Lynch, and Mike Paterson were awarded a Dijkstra Prize for this significant work. The FLP result does not state that consensus can never be reached: merely that under the model’s assumptions, no algorithm can always reach consensus in bounded time. In practice it is highly unlikely to occur.

As far as I can tell the “surprising” result is that unbounded delays are not bounded. Without timeouts, we can not distinguish between a site B that will eventually send a message to site A and a site B that has failed (crashed) and will never send a message.

 

 

Why is clock synchronization so important in big data

Distributed transactions have historically been implemented by the database community in the manner pioneered by the architects of System R* [22] in the 1980s. The primary mechanism by which System R*-style distributed transactions impede throughput and extend latency is the requirement of an agreement protocol between all participating machines at commit time to ensure atomicity and durability. To ensure isolation, all of a transaction’s locks must be held for the full duration of this agreement protocol, which is typically two-phase commit.

The problem with holding locks during the agreement protocol is that two-phase commit requires multiple network round-trips between all participating machines, and therefore the time required to run the protocol can often be considerably greater than the time required to execute all local transaction logic. If a few popularly accessed records are frequently involved in distributed transactions, the resulting extra time that locks are held on these records can have an extremely deleterious effect on overall transactional throughput. [ Calvin: Fast Distributed Transactions for Partitioned Database Systems. Alexander Thomson et al]

This is one reason why we have all the new databases without transaction support – because high transaction rates in a distributed environment (e.g. web or IOT applications in the Cloud) cannot be scaled in face of lock overhead.

 

Patents considered harmful, by some.

That’s not how it looks from here but I think part of the muddiness in the software patent argument is the result of arguments that really attack the entire idea of patents but are advanced as being specific to software patents. The whole idea of a patent is that you do foster innovation and competition by “handing out monopolies”. Maybe that idea is wrong, but if you think it is wrong you are likely opposed to patents – period.  Tim Lee and many of the other

Famous anti-innovation patenter.

Famous anti-innovation patenter.

people who dislike software patents confuse the issue by simultaneously claiming that (1) software patents are inherently worse/different than other patents and (2) that software patents choke off innovation because of properties that turn out to apply to all patents. Opposing patents in general, however, is a more radical proposition than opposing software patents – perhaps more a more radical proposition than people feel comfortable making.

My position is that many patents are wrongly granted for “inventions” that are neither novel nor un-obvious and that the system for adjudicating patents is way too slow, error prone, and expensive. But patents themselves serve a useful purpose.  The obvious example is Excel which is effectively a monopoly without the benefit of any patents at all. The work of the innovators was, without patent protection, rapidly copied by companies with better marketing and greater resources. Innovation stopped. End of story.

And “business method” patents, in general, are not really software or computer patents at all. Usually they are efforts to patent a well known method of doing business adding some technology to the mix to buttress a claim of novelty. One could have similarly claimed a hundred years ago that making sales calls via telephone was an invention or that delivering adverts by TV instead of by radio was an invention.

 

A claimed validated operating system.

The claim: we have demonstrated the comprehensive formal verification of the seL4 microkernel, with a complete proof chain from precise, formal statements of high-level security and safety properties to the  binary executable code. GD

The L4 base is useful –  we advocated a similar approach with RTLinux which was, um, very similar to L4. It looks like the L4 version here uses the interrupt emulation method at the heart of RTLinux (of course, without any attribution or reference). Just for the record, here’s a much earlier RTLinux based effort.

As for the verification, I am highly skeptical of the claim. Here’s the claimed proof and the research paper.  It’s not at all clear exactly what was validated, but from the paper it looks like the 8000 odd lines of l4 microkernel were shown to provide the functionality described in the Hoare logic specification. No device drivers appear to have been validated.