Category Archives: operating systems

The replicated state machine method of fault tolerance from 1980s

The first time I saw this method was when I went to work for Parallel Computer Systems, , later called Auragen, in the famous tech startup center of Englewood Cliffs, New Jersey. I commuted there from the East Village. (True story: I applied for the job after finding an advert in a discarded copy of the NY Times on the floor of a Brooklyn apartment while visiting friends. I sent via US mail a resume typed on a manual typewriter- I’m tempted to  say “composed by the light of a tallow candle” but that would be over the top- and forgot to send the second page. )

The company built a parallel computer based on Motorola 68000s  with a replicated message bus. The bus guaranteed message delivery to 3 destinations would either succeed to all three or fail to all three. This property is called “reliable broadcast”.  All interprocess communication was by message transfer (a fashionable idea at the time). Each process had a backup.  Whenever a primary process sent a message, the message was also delivered to the backup and to the destination backup. If the primary failed, the backup could be run. The backup would have a queue of messages received by the primary and a count of messages sent by the primary.  When the recovering backup tried to transmit a message, if the count was greater than zero, the count would be decremented and the message discarded because it has already been transmitted by the old primary. When the recovering secondary did a receive operation, if there was a message on the input queue, it would get that message.  In this way, the recovering backup would repeat the operations of the primary until it caught up. As an optimization, the primary could be periodically checkpointed and queues of duplicated messages could be discarded.

The operating system was an implementation of UNIX. In practice, it was discovered that making each UNIX system call into a message exchange, which was an idea advocated in the OS research community at the time, caused serious performance problems.  The replicated state machine operation depended on this design  in order to make the state machine operation deterministic. Suppose the primary requested, for example,  the time and then made a decision based on the time.  A recovering secondary would need exactly the same time to guarantee that it produced the same results as the primary. So every interaction between application and OS needed to be recorded in a message exchange.  But a message exchange is nowhere near as fast as a system call (unless the OS developers are horrible).

The performance issue was mitigated by some clever engineering, but  was a problem that was discovered in parallel by a number of development teams working on distributed OS designs and micro-kernels which were in vogue at the time. Execution of “ls -l” was particularly interesting.

Anyways, here’s the description from the patent.

To accomplish this object, the invention contemplates that instead of keeping the backup or secondary task exactly up to date, the backup is kept nearly up to date but is provided with all information necessary to bring itself up to the state of the primary task should there by a failure of the primary task. The inventive concept is based on the notion that if two tasks start out in identical states and are given identical input information, they will perform identically.

In particular, all inputs to a process running on a system according to the invention are provided via messages. Therefore, all messages sent to the primary task must be made available to the secondary or backup task so that upon failure of the primary task the secondary task catches up by recomputing based on the messages. In essence, then, this is accomplished by allowing every backup task to “listen in on” its primary’s message.

United States Patent 4,590,554 Glazer ,   et al.May 20, 1986

Inventors: Glazer; Sam D. (New York, NY), Baumbach; James (Brooklyn, NY), Borg; Anita (New York, NY), Wittels; Emanuel (Englewood Cliffs, NJ)
Assignee: Parallel Computers Systems, Inc. (Fort Lee, NJ)
Family ID: 23762790
Appl. No.: 06/443,937
Filed: November 23, 1982

See also: A message system supporting fault tolerance.

and a very similar later patent.

The Auragen file system.

This article on the interesting Wave Transactional File System inspired me to look up an earlier file system that also used copy on write semantics.


Anita Borg, Wolfgang Blau, Wolfgang Graetsch, Ferdinand Herrmann, and Wolfgang Oberle. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1 (January 1989), 1-24. DOI=


4.3 Availability of the File System
Since a recovering file server reconstructs its buffers by reading blocks from the file system, the file system in the state as of the last sync must be available. The existence of that version of the file system is also necessary during recovery as the file server redoes requests. For example, if a file has been deleted since sync and a read request is reissued, the disk driver, and thus the recovering file server, will behave differently than the primary. Unfortunately, the contents of the disk can change between syncs, at least during the Fsync that constitutes the first phase of the sync operation.

The solution is to use a copy-on-write strategy between syncs, rather than overwriting existing blocks. Logically this corresponds to keeping two versions of a file system.3 An early version of the file system organization described here is discussed in Arnow [ 11].

There are two root nodes on disk. At any given time one of them is valid for recovery. We refer to the other as the alternate root. Associated with each root is state information (the state tables described above), the most recent being that associated with the currently valid root. Changes to the file system are done relative to a copy of the valid root kept in memory in the primary file server’s address space, and in a nondestructive manner, as seen in Figure 2(a-d). Freed blocks, which contain the old data, are added to a semi-free list, and cannot be reallocated until after the next sync. Therefore, the unmodified file system still exists rooted in the valid on-disk root node.

If a crash occurs at any time between syncs, the recovering file server is able to determine which root to use because of information sent on the primary’s last sync. It reads in the correct state information and reconstructs its buffers accordingly. Disk blocks that were used by the primary since the last sync appear to it as free blocks.

The difficult case is when a crash occurs during a sync. To see that the solution works in this case, consider the sequence of actions that take place during a sync. First, all dirty blocks except the root are written to disk, and old blocks are added to the semi-free list. Second, the state information is collected and written to the alternate state area. Third, the in-memory root is written to the alternate on disk root block, Finally, the sync message is constructed and sent to the backup. It contains the information necessary to update message queues as well as specifying which on-disk state information and root block to use on recovery.

Once the sync message has been sent, the semi-free list is added to the free list and the primary continues. Just before the sync message is sent, there are two copies of every modified data and indirect block. At any time before the sync message is sent, the old consistent state is available. Any time after it is sent, the new state and file system will be used and message queues consistently updated. An additional benefit of this organization is that the file system as a whole is considerably more robust than a standard UNIXstyle file system. Even if the entire system is shut down in an uncontrolled way as the result of multiple faults or operator error, there will always be an entire consistent file system on disk.

What does the UNIX file system do?

Unix, Linux, Windows and other operating systems and the world wide web all support file systems with the familiar path file names  like

 or "/system/passwords/secret/dontread.txt"

although sometimes with different separator characters between the individual “flat” file names. For example, Windows uses “\”. As long as we know how to separate flat file names in the sequence, it doesn’t matter. The flat file names are chained together in a path through the file system that shows “where” a file can be found. URL’s in the world wide web are just path file names with some more information around them.  Constructing the file system involves a clever technique for embedding a tree in a simpler file system where file names are just numbers.

For historical reasons, the base file system uses numbers called “inode numbers” to name files.  Ignoring modifications, this file system looks like a function F:InodeNumbers → FileData. The tree emerges from information stored in some of the files. FileData includes some files that are just data and some files that are maps called “directories” (or “folders”). Directory maps have the form d: SimpleFileNames → InodeNumbers.  If we have a path file name “a/b/c” and a starting inode number i, we can first get d1 = F(i), the contents of file i which should be a directory, then get ia= d1(a) the inode number of the file named a, and then da= F(ia) and  ib= da(b) and db= F(ib) and  ic= db(c)  and then the contents of the file “a/b/c” is F(ic ) – assuming that the path is defined.  More concisely, we can write  ia= F(i)(a) and  ib= F(ia)(b) and so on where functions are resolved left to right: for example, F(i)  is a map which is then applied to a. 

Computing the translation of a path file name to an inode number can be defined recursively in terms of a function usually called namei (for names to inode numbers).  If the path file name is the empty path, then we are already where it leads: namei(i,Empty) = i.  If the path file name is not empty, it has the form a/p  where  is a simple file name (of any length) and  is a path file name with one less simple file name in it than the original path: namei(i,a/p) = namei(F(i)(a),p) It’s possible that namei(i,p) is not defined – for example, F(i) might not even be a directory function or it might be one but d=F(i) might not be defined on the leftmost simple file name in the path. In that case, we have “file not found” or “404” in the case of a URL.

A UNIX type file name has a special inode number for the “root” directory.  For any path  file contents is then U(p)= F(namei(root,p)).  A consistent file system will have at least the following properties.

  1. No orphans. For every  in InodeNumbers,  if F(i)  is defined there must be a path  so that namei(root, p) = i. 
  2. No dangling references. For every so that F(i)  is a directory function and for ever simple file name so that F(i)(a)  is defined, F( (F(i))(a)) must also be defined (that is, if F(i)=d  and d(a)=j  it must be the case that  F(j) is defined.)  

Another useful property limits cycles or loops through the file system and aliases. If U(p)  is a directory, let Children(p) = {a: U(p)(a) is defined} where  is a variable over flat file names.  Then define find(p) = {p} if is not a directory or Children(p) = emptyset and define find(p) = union{find(pa): a in Children(p)} . If there are no loops, this is a well defined function that terminates with the set of leaf nodes reachable from p. For example if one were in an organization concerned about security, there might be regular monitoring of find(/home/snowden) to see if any unauthorized data had been collected.

The most stringent non-alias requirement would be that if namei(root,p) = namei(root,q) then p=q. There can be no loops if there are no aliases. This requirement is usually relaxed to accommodate the “parent” and “self” pseudo file names, and hard and soft links. The simple file name “.” is usually reserved to mean “self” so that if F(i)  is a directory F(i)(“.”) = i. The pseudo-file-name “..” is used for “parent” so that if F(i)(a)=j  and F(j) is also a directory, then F(j)(“..”) = i. These pseudo-file names introduce both loops and aliases so we could just limit the requirement for no aliases to the cases where and  don’t contain any pseudo-file-names. Note that the definition of the parent pseudo-file-name limits many kinds of loops because it cannot be that a directory points back at two different parents.

Soft links, a later addition to UNIX files, are a more complex problem. For soft links we add file contents that are path file names and modify namei  so that if j=F(i)(a)  is a soft link with F(j)=q, then namei(i,a/p) = namei(root,concat(q,p)). The original definition of namei has a nice property that the path shrinks by one flat name at every step and this change loses that property and makes it easy to create loops that never finish. The solution to that is to count soft links and just give up if a path takes us to more than some set limit number of soft links.



A claimed validated operating system.

The claim: we have demonstrated the comprehensive formal verification of the seL4 microkernel, with a complete proof chain from precise, formal statements of high-level security and safety properties to the  binary executable code. GD

The L4 base is useful –  we advocated a similar approach with RTLinux which was, um, very similar to L4. It looks like the L4 version here uses the interrupt emulation method at the heart of RTLinux (of course, without any attribution or reference). Just for the record, here’s a much earlier RTLinux based effort.

As for the verification, I am highly skeptical of the claim. Here’s the claimed proof and the research paper.  It’s not at all clear exactly what was validated, but from the paper it looks like the 8000 odd lines of l4 microkernel were shown to provide the functionality described in the Hoare logic specification. No device drivers appear to have been validated.

Real-time Linux

My opinion has always been that the Linux-RT project was based on an unfixable engineering error.


A few words on the status and the future of RT:

The situation since last years RTLWS (
has not improved at all, it's worse than before. 

While shortly after RTLWS quite some people promised to whip up proper
funding, nothing has materialized and my personal situation is worse
than before.

I'm really tired of all the politics involved, the blantant lies and
the marketing bullshit which I have to bear. I learned a few month ago
that a certain kernel vendor invented most of RT anyway and is the
expert in this field, so the customers dont have to worry about my

Just for the record: The initial preempt-RT technology was brought to
you mostly by Ingo Molnar, Steven Rostedt, Paul Mckenney, Peter
Zijlstra and myself with lots of input from Doug Niehaus, who
researched full in kernel preemption already in the 1990s. The
technology rewrite around 3.0-rt was done by me with help from Peter
and Steven, and that's what preempt-RT today is based on.

Sure, people can believe whatever marketing bullshit they want, but
that doesn't make the truth go away. And the truth is, that those who
claim expertise are just a lying bunch of leeches.

What really set me off was the recent blunt question, when I'm going
to quit. What does this mean? Is someone out there just waiting that I
step down as preempt-RT maintainer, so some corporate entity can step
up as the saviour of the Linux RT world? So instead of merily leeching
someone seeks active control over the project. Nice try.

Free Software Patents: Alan Cox

1. A method, comprising: storing, by one or more processors, confidential data in a confidential section of virtual memory, wherein storing the confidential data in the confidential section of virtual memory comprises: mapping the confidential section of virtual memory to an address space in a first physical memory device; storing the confidential data in the first physical memory device; and marking the address space in the first physical memory device as having confidential data; receiving a request to copy data stored in the address space in the first physical memory device to a second physical memory device, wherein the second physical memory device has more capacity and slower memory access speed than the first physical memory device; determining that the address space in the first physical memory device has been marked as having confidential data; and denying the request to copy in response to determining that the address space in the first physical memory device has been marked as having confidential data.

2. The method of claim 1, wherein the request to copy data stored in the address space in the first physical memory device is received as a result of a power-saving operation.

3. The method of claim 2, the operations further comprising: copying data stored in non-confidential sections of the virtual memory to the second physical device; completing the power-saving operation; and upon resuming from the power-saving operation: determining one or more processes had been using the confidential data; and providing a warning to the one or more processes that the confidential data was not copied to the second physical memory device.

4. The method of claim 2, the operations further comprising: copying data stored in non-confidential sections of the virtual memory to the second physical device; completing the power-saving operation; and upon resuming from the power-saving operation: determining one or more processes had been using the confidential data; and terminating the one or more processes

Inventors: Van Riel; Henri Han(Nashua, NH)Cox; Alan(Surrey Resgarch Park, GB)
Assignee: Red Hat, Inc.


[0008] In accordance with one embodiment of the invention, a method of protecting confidential data is provided. When a request to allocate space in a virtual memory for confidential data is received, a portion of the virtual memory is marked as confidential. It is determined if a portion of a physical memory has been assigned for the confidential portion of the virtual memory. The portion of the physical memory that has been assigned for the confidential portion of the virtual memory is then marked as having confidential data.

[0009] In accordance with another embodiment of the invention, a method of protecting data allocated to a confidential area of virtual memory that is stored in physical memory is provided. When contents of the physical memory are being written to another location, contents of the physical memory that correspond to data allocated to the confidential area of the virtual memory are identified. The identified contents of the physical memory are then protected.

[0010] Additional embodiments of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

And then

1. A method for reducing the number of calls from an operating system to an application program, comprising the steps of: associating in the operating system at least one indicia with a first request to access hardware, the indicia indicating a type of notification to be provided by the operating system to the application program upon completion of the first request; receiving from the application program a second request; and based on the second request, de-associating one or more of the at least one indicia from the first request so that notification no longer needs to be provided by the operating system to the application program upon completion of the first request.

2. The method according to claim 1, wherein the notification comprises an operating system call.

3. The method according to claim 1, wherein the first request and the second request comprise input-output requests received from the application program.

4. The method according to claim 1, wherein the first request and the second request comprise a linked list.

5. The method according to claim 1, wherein the first request and the second request comprise a table.

6. The method according to claim 1, wherein the indicia comprises a flag.

Inventors: Cox; Alan(Swansea, GB)
Correspondence Address:  


Assignee: Red Hat, Inc.

[0011] In one embodiment of the present invention, a task can be added to the kernel input/output (I/O) queue while that queue of asynchronous I/O is being processed. The kernel can provide or set indicia, such as a flag, that is readable, for the example, by the application program. The flag can indicate whether or not the kernel is processing any I/O for a particular process (task). For example, while the I/O queue is being processed, the operating system kernel can receive, from an application program can, pertinent data (such as, for example, the file being written to, the data that is to be written to a file, and whether the application is to be notified upon completion of the write operation). The request is written atomically to the kernel I/O queue. When the process has a next kernel I/O request, the process examines the flag to determine if the kernel has completed I/O for the process. If the flag indicates that the I/O queue is completed for the process, the kernel receives a system call. If the flag indicates that the I/O queue is not completed, then the application program need not make a system call. When the I/O is completed, the kernel can check for race conditions. If another request is present in the I/O queue due to a race condition, the kernel can dispatch the request by using a kernel interrupt handler, rather than waiting for the application program to issue a system call to the kernel.

The multics file system

The design proposed in this paper is ubiquitous.

A file is simply an ordered sequence of elements, where an element could be a machine word, a character, or a bit, depending upon the implementation. A user may create, modify or delete files only through the use of the file system. At the level of the file system, a file is formatless. All formatting is done by higher-level modules or by user-supplied programs, if desired. As far as a particular user is concerned, a file has one name, and that name is symbolic. (Symbolic names may be arbitrarily long, and may have syntax of their own. For example, they may consist of several parts, some of which are relevant to the nature of the file, e.g., ALPHA FAP DEBUG.) The user may reference an element in the file by specifying the symbolic file name and the linear index of the element within the file. By using higher-level modules, a user may also be able to reference suitably defined sequences of elements directly by context.

A directory is a special file which is maintained by the file system, and which contains a list of entries. To a user, an entry appears to be a file and is accessed in terms of its symbolic entry name, which is the user’s file name. An entry name need be unique only within the directory in which it occurs. In reality, each entry is a pointer of one of two kinds. The entry may point directly to a file (which may itself be a directory) which is stored in secondary storage, or else it may point to another entry in the same or another directory. An entry which points directly to a file is called a branch, while an entry which points to another directory entry is called a link. Except for a pathological case mentioned below, a link always eventually points to a branch (although possibly via a chain of links to the branch), and thence to a file. Thus the link and the branch both effectively point to the file. (In general, a user will usually not need to know whether a given entry is a branch or a link, but he easily may find out.)



Dutch masters

Seen on Linux Weekly News.

Ext4 maintainer Ted Ts’o has responded with a rare (for the kernel community) admission that technical concerns are not the sole driver of feature-merging decisions:

It’s something I do worry about; and I do share your concern. At the same time, the reality is that we are a little like the Old Dutch Masters, who had take into account the preference of their patrons (i.e., in our case, those who pay our paychecks :-).

One of those rare moments when art, commerce, and engineering collide to produce comedy.

why microkernels don’t work

You can almost just see it from this diagram of connected boxes.  I want to think of the whole system as a series of connected state machines.  The arrows show how information is moved around the system with the green arrows identifying paths that carry data to and from the memory. When you fill in the details, you start to see that the proposed sheering off of the “fileserver” from the remaining “kernel” does not actually split state as much as it reproduces it. So much of the state of the rump Kernel needs to be available to the FileServer that the proposed modularity disintegrates.”’

The counter argument, in its best form, can be found at QNX.