What does the UNIX file system do?

Unix, Linux, Windows and other operating systems and the world wide web all support file systems with the familiar path file names  like

 "/home/snowden/travel/boardingpass.pdf"
 or "/system/passwords/secret/dontread.txt"

although sometimes with different separator characters between the individual “flat” file names. For example, Windows uses “\”. As long as we know how to separate flat file names in the sequence, it doesn’t matter. The flat file names are chained together in a path through the file system that shows “where” a file can be found. URL’s in the world wide web are just path file names with some more information around them.  Constructing the file system involves a clever technique for embedding a tree in a simpler file system where file names are just numbers.

For historical reasons, the base file system uses numbers called “inode numbers” to name files.  Ignoring modifications, this file system looks like a function F:InodeNumbers → FileData. The tree emerges from information stored in some of the files. FileData includes some files that are just data and some files that are maps called “directories” (or “folders”). Directory maps have the form d: SimpleFileNames → InodeNumbers.  If we have a path file name “a/b/c” and a starting inode number i, we can first get d1 = F(i), the contents of file i which should be a directory, then get ia= d1(a) the inode number of the file named a, and then da= F(ia) and  ib= da(b) and db= F(ib) and  ic= db(c)  and then the contents of the file “a/b/c” is F(ic ) – assuming that the path is defined.  More concisely, we can write  ia= F(i)(a) and  ib= F(ia)(b) and so on where functions are resolved left to right: for example, F(i)  is a map which is then applied to a. 

Computing the translation of a path file name to an inode number can be defined recursively in terms of a function usually called namei (for names to inode numbers).  If the path file name is the empty path, then we are already where it leads: namei(i,Empty) = i.  If the path file name is not empty, it has the form a/p  where  is a simple file name (of any length) and  is a path file name with one less simple file name in it than the original path: namei(i,a/p) = namei(F(i)(a),p) It’s possible that namei(i,p) is not defined – for example, F(i) might not even be a directory function or it might be one but d=F(i) might not be defined on the leftmost simple file name in the path. In that case, we have “file not found” or “404” in the case of a URL.

A UNIX type file name has a special inode number for the “root” directory.  For any path  file contents is then U(p)= F(namei(root,p)).  A consistent file system will have at least the following properties.

  1. No orphans. For every  in InodeNumbers,  if F(i)  is defined there must be a path  so that namei(root, p) = i. 
  2. No dangling references. For every so that F(i)  is a directory function and for ever simple file name so that F(i)(a)  is defined, F( (F(i))(a)) must also be defined (that is, if F(i)=d  and d(a)=j  it must be the case that  F(j) is defined.)  

Another useful property limits cycles or loops through the file system and aliases. If U(p)  is a directory, let Children(p) = {a: U(p)(a) is defined} where  is a variable over flat file names.  Then define find(p) = {p} if is not a directory or Children(p) = emptyset and define find(p) = union{find(pa): a in Children(p)} . If there are no loops, this is a well defined function that terminates with the set of leaf nodes reachable from p. For example if one were in an organization concerned about security, there might be regular monitoring of find(/home/snowden) to see if any unauthorized data had been collected.

The most stringent non-alias requirement would be that if namei(root,p) = namei(root,q) then p=q. There can be no loops if there are no aliases. This requirement is usually relaxed to accommodate the “parent” and “self” pseudo file names, and hard and soft links. The simple file name “.” is usually reserved to mean “self” so that if F(i)  is a directory F(i)(“.”) = i. The pseudo-file-name “..” is used for “parent” so that if F(i)(a)=j  and F(j) is also a directory, then F(j)(“..”) = i. These pseudo-file names introduce both loops and aliases so we could just limit the requirement for no aliases to the cases where and  don’t contain any pseudo-file-names. Note that the definition of the parent pseudo-file-name limits many kinds of loops because it cannot be that a directory points back at two different parents.

Soft links, a later addition to UNIX files, are a more complex problem. For soft links we add file contents that are path file names and modify namei  so that if j=F(i)(a)  is a soft link with F(j)=q, then namei(i,a/p) = namei(root,concat(q,p)). The original definition of namei has a nice property that the path shrinks by one flat name at every step and this change loses that property and makes it easy to create loops that never finish. The solution to that is to count soft links and just give up if a path takes us to more than some set limit number of soft links.