The lost wakeup problem

One basic problem in operating system design is how to make efficient blocking requests. Thread A wants something from Thread B (or ISR B) and (1) requests and (2) blocks. There are thousands of ways in which the interval between (1) and (2) can lead to a “missed wakeup”. That is, A requests, B signals A,  A sleeps waiting for a second wakeup that never happens. This problem is why bus designers moved from “edge trigger” to “level trigger” interrupts but it is even worse in software. If chip designers knew or cared anything about OS design, instead of wasting billions of transistors on ridiculous “features” that only make the software slower, they could easily give us some atomic latches for synchronization, but part of the fun of OS design is figuring out how to compensate for clumsy processor methods. In a general purpose OS, these mechanisms can be enormously inefficient without making much of a difference- in fact, Operating Systems can be done in by creeping inefficiency where there are so many slow and inefficient parts that tolerances throughout the system keep relaxing. I used to have this argument all the time with some of our programmers who claimed correctly that “since X introduces an N time unit worst case delay, designing Y to have less than N in this case will not have a visible effect.” The simple counter argument is “we’re going to fix X later”, but the real argument is that designing to the level of existing problems that are already in the system makes a failure part of the spec”. So as we design our new enterprise real-time OS, we are paying a lot of attention to fine details – and we’re trying to take advantage of lessons learned about how these things will be used.