Notes on Live Scheduling [1]

I went back to graduate school in the last millennium hoping to find some levers for improving the process of building and architecting operating systems. A couple of years working on a fault tolerant distributed OS made me unhappy about working without blueprints and without methods of doing even back-of-the-envelope design verification. Too much inspired guessing (hard for those of us whose inspiration is limited) and laborious rewriting was needed to compensate. There was never time to optimize because so much time was absorbed rewriting or discovering design errors by debugging or performance testing. A couple of years of studying “formal methods”, however, showed this not to be a particularly fruitful area of inquiry[2]

It turns out that the humble and well known state machine is a pretty good basis for reasoning about operating systems and hardware. I have found compact ways of representing state machines by recursive functions on strings and composing state machines in a product that allowed arbitrary interconnection of components[3]. If you consider a state machine to compute an output from the sequence of events that drove it from its initial state, you can ignore all that tiresome detail of naming the state set and figuring out the transition function and can concentrate on what the thing does.

	State Machine
Input Symbols»»»		Output symbols»»»

I’ll write w|M for the output of state machine M in the state reached by following the string of operations w from the initial state[4]. A string is just a finite sequence of symbols. I’ll write “” for the empty string, wa for the string obtained by appending symbol a to string w and wz for the string obtained by concatenating string z to string w (if the context doesn’t make it clear whether z is a string or a symbol, I’ll try to note it explicitly). The state machines I’m using are Moore machines - state machines that produce a discrete output in each state [5]. Given a state machine M and a string w, Then “”|M is the output of M in the initial state. We can now define state machines recursively- most trivially a state machine that counts inputs “”|Count = 0 and wa|Count = w|Count+1, and a state machine modeling a counting device that has an unspecified start state and rolls over:

wa|Counter_k = (w|Counter_k+1) mod k [5]

Less trivially one state machine can be specified in terms of the behavior of others on the same input string, for example a less precise counter, call it X, can be defined by:

w|Counter_k - e <= w|X_k <= w|Counter_k +e.

More on that later: we have enough mechanism to at least start on the first propositions. I'm going to restate the property described in the first chapter, but I'm not going to use it directly. As I tried to make this property precise, I realized it could be simplified and made more direct.

Live1: There is some N and some E<N so that for every process P, if P is ready to run at time T, then by time T+N nanoseconds, P will have either run for at least E additional time units or P will have requested an I/O operation which caused it to suspend.

Traditional UNIX schedulers apply a heuristic method by increasing priority by waiting time and also for processes that seem I/O or OS request bound. Let's track how long an executing process has been executing.

So we can consider a process to be compute bound if it completes a time slot of, say E time units.
""|ComputeBound(P)=0
wa|ComputeBound(P) =
1 if w|RunningTime(P) >= E and wa|Executing(P)=0
0 if wa|Running(P) < E and wa|Executing(P)=0
w|ComputeBound(P) otherwise

Suppose we have another primitive state variable available to determine whether a process is blocked or not - it could be waiting or dead or not yet created

w|Blocked(P) = 0 if the process is not waiting for the OS to provide a service

and 1 otherwise.

In practice, this will be something like (P->status)

The scheduler will increment the priority of processes if they are waiting and if they are blocked and decrement if they are compute bound.

Over the last 30 years, enormous effort has gone into elaborating this algorithm with different scaling factors, longer histories that may improve prediction (e.g. instead of just considering whether the process is currently blocked we could remember the last 10 times it has stopped or ... ). One thing you can say about algorithms like this is that they are sensitive to load in ways that are hard to predict. Another thing you can say is that the basic algorithm was designed for a function that hardly exists anymore - departmental compute servers where the OS is supposed to be maximizing the utility of the processor and I/O system shared between many users with many different goals.

We can define how much time a process requires during the next t time units to avoid scheduling failure. Let H be the time required for scheduling and switching processes (the overHead).

w|Requires(P,t) = 0 if w|Waiting(P)+t < N-(E+H)
(E+H) otherwise

Or we could replace E with some calculation based on past performance predicting how much of E will be consumed by P before an OS request is made. Requires is definitely an over-estimate but let's look at it for a moment.

if w|Requires(P,t) > t then there is no way to keep our commitment to P
If (Sum(all P) w|Expects(P,t) > t), then we cannot maintain liveness. If the OS does not exercise some sort of admission regulation to the scheduler - checking to see if a new process can be accomodated - then there is no way to protect against failure. And real-time constraints run directly counter to the scheduling algorithm. In a RT system, if:

w|Requires(P,E+H) = E+H and w|Running(P') > SomeLargeNumber
but BasePriority(P')> BasePriority(P), then it is too bad for P. UNIX designers have traditionally tried to sidestep this issue by making the base priority of "real-time" processes high enough so that calculations of dynamic priority cannot override base priority - which means that liveness is even more fragile.

I'm going to give scheduling a rest for a moment and discuss semantics of storage and file systems next.