Specifying the UNIX file system

The UNIX file system design is one of the most brilliant constructions in programming with a combination of conceptual simplicity, utility, and suitability for implementation that is hard to match. (Note: this last quality is something that is prized in all engineering disciplines, but we don’t even have a standard term for it in programming.) What I’m going to do here is specify the operation of the file system, separating it from other parts of the operating system: in particular the I/O subsystem and the process management system. In the second section, I will show how to snap-together specificationsi to get a start on composition. I think I may have picked to complex an example for that section, but take a look.

Posix file system as a function on inodes

The basic plan here is to first spend some time representing a file system contents and only then worry about how it changes state.

The basic operation of a UNIX file system is association of inode numbers with files. We can think of the file system, in some state, as a map FileSystem: InodeNumbers &arrow File. One of the smart ideas in UNIX was “everything is a file.” In fact there are three file types in a classical UNIX file system

  1. Data files – just sequences of bytes (strings).
  2. Directories which contain pairs (inode number, name) associating names with inodes
  3. Device files which contain some data that allows the operating system to find a drivers from the file.

There are also permission descriptors but we can get to that later.

A UNIX path is a string that includes “/” characters as delimitors so that “n1/n2/n3 … nN” means “in the current directory find pair (n1,i1) and inode i1 should be associated with a second directory which should contain a pair (n2,i2) and so on down to some file nN. T The namei operation translates a path to an inode. Let’s make the representations more complete.

A POSIX file is either

  1. The integer value -1 representing – does not exist.
  2. A string representing a data file. Basic operations on a data file are writing new data (inserting)
    and truncating the file. Let insert(q,n,z) insert string z in string
    q after position n and let truncate(q,n) produce the string obtained by dropping
    all characters after position n. Then the POSIX file operation to write character r after
    a “seek” position in the file causes a change from q to insert(q,seek,r) and the
    truncate operation – well, truncates the file.
  3. A non-negative integer representing a device file. In classical UNIX this is the pair of major and
    minor number, but here we will be more general.
  4. A list ((i,.),(j,..),[0 or more pairs (k,q)]) representing a directory so that
    inode number i has a list ((i,.),(j,..) … ) where j is the inode number of
    the parent of i. Let match(List,q) find the first pair (i,q)
    in the list that matches name q (because we don’t know if there is more
    than one match) and let it have the value -1 if there is no match. Then the
    parent of List is match(List,”..”). An empty directory looks like
    ((i,..),(j,.)). Let addpair(List,(k,q)) just add the pair (k,q) to
    the List and let delpair(List,k) take the first pair (k,q) that matches out of the List

So a snapshot of file system state can be given, at any moment by a function mapping inode numbers to files. The hard part, sort of, is showing how those functions change in response to operations on the filesystem. Operations which change the state of the contents are create, write, truncate, unlink, and link. Although these operations are mostly operations on file descriptors at the user level, they are operations on inode numbers inside the file system. To make some sense of it all, let namei convert a path name into an
inode number as follows:

If f:InodeNumbers &arrow; PosixFiles is a function, then

GoodDirectory(f,j) =
	1	if f(j) is a list that starts with (j,.). (i,..) and either nothing else or
			pairs (k,q) so that k != i and k !=j and so that if (k,p) is in
			the directory no (k',p) is in the directory for k' != k.
	0	otherwise

Parent(f,i) =
	j if (j,..) is in f(i) and GoodDirectory(f,i)
	-1 otherwise

namei(f,j, path) =
	{k: (k,path) is in f(j)}  if Goodirectory(f,j) and path contains no / symbols
	{namei(f,k,q): (k,p) in f(j)}	if GoodDirectory(f,j) and path=p/q where p contains no / symbols
	namei(f,1,p) if ppath=/p
	{} otherwise

The iname function reverses the lookup - it is impractical to implement, but useful conceptually.
iname(f,j) = {p : namei(f,1,p)=i}

f is consistent if and only if for every i in I,
if f(i) &gte; 0 then iname(f,i) != emptyset.
and if f(i) is a list than GoodDirectory(f,i) = 1.

Notice that until this point, I’ve not used anything past normal sets/functions/relations. Now I want
to be able to describe how an operation transforms some file system map f into f’. I’m going to use low level versions of the operations on files here – I’ll show how to relate these to high level POSIX user API later. The key operations are

  1. create(parent,new,name,type) where type is Directory, or Data, or
    a Device identifier – I’m assuming that Device identifiers are distinct from the other two types.
  2. remove(i,j) to remove the first (j,q) pair from the directory at inode i
  3. free(i) to delete a file of any kind
  4. write(i,q,n) to insert the data in string q into the file after position n

We’ll say |X implements a POSIX filesystem contents on set of inode numbers I
if and only if:

wa|X(i) = 	""	if a=create(j,i,q,data)
		((i,.),(j,..))	if a=create(j,i,q,directory)
		D	if a=create(j,i,q,Dev)
		addpair((j,q),w|X(i))	if a=create(i,j,q,...)
		delpair(j,w|X(i))	if a=remove(i,j)
		-1		if a=free(i)
                insert(q,n,w|X(i))	if  a=write(i,q,n)
		truncate(n,w|X(i))	if a=truncate(i,n)
		w|X(i)			otherwise

Nothing in this specification prevents destruction of file system consistency – for example by turning a directory into a data file and stranding the files beneath it. We add that below.

Adding permissions and link counts

Let’s add something for permissions. This takes us into the fuzzy boundary between other parts of the OS and the file system.

The UNIX designers separated the management of file descriptors from inodes quite cleanly. Each process has open file descriptors that the OS associates with inode numbers – file descriptors are process specific. To really model the operation of the file system in the operating system, we need to model both concurrent operations and asynchronous operations – as requests stall waiting for blocks. In fact, we need to relate the file system to the process model sketched in the first section. Before getting into all that, however, I’m going to spec out a much simpler system consisting of synchronous immediate operations on inodes, but adding permission bits.

For this state machine we need some new operations and new parameters to create. How to relate this state machine to the file system contents state machine and how to work with two slightly different alphabets is below.

|Y is a state machine implementing permissions on a POSIX file system with inodes I if:
wa|Y(i) =
	N	if a=chmod(i,N)
	w|Y(i)	otherwise.

|R is a state machine recording return value if and only if

wa|R() = a

M is a (very high level ) memory if and only if
wa|M(c) =
	z 	if a=copy(z,c)
	w|M(c)	otherwise

So how do we relate the |X ,|Y, |R and |M state machines? Well, we can define a file system implementation |F and define a relationship between events for |F and for “encapsulated” copies of the other machines. The value of w|F will depend partially on x(w)|X where “x” transforms high level events for the file system into events on the inode map so that, for example, if some create is not permitted due to permissions, a create event will be filtered out before it gets to |X. Instead of defining functions directly by recursion on how input changes output, e.g. wa|X = f(a,w|X), we define functions indirectly by recursively defining how the input changes – e.g w|Z = x(wa)|X and x(wa) = f(a,x(w))

The return value of the last operation will be given by r(w)|R
The results of the last read to address c are in m(w)|M(c)
The data to be written for a write from address c is in m(w)|M(c)

Permissions(F,i) = y(w)| Y(i)
Contents(F,i) = x(w)| X(i)
ReadValue(F,c) = m(w) | M(c)
w|PickInode =
	some i	s.t. x(w)|X(i) = -1
	or -1	if there is no such i

w|Permitted(a) =
	1	if	a=create(name,type,permissions [,data])
			and 	either name contains no / symbols
					and there is no (j,name)  in x(w)|X(j)
				or name = p/q where q contains no / symbols
					and namei(x(w)|X,/p) contains j
			and GoodDirectory(x(w)|X,j)
			and y(w)|Y(j) has write permission on
		or if	a=delete(j) and y(w)|Y(j) has write permission
			and either	x(w)|X(i) is a file or node or a directory with only . and ..
		or if	a=write(j,c,n)
			and y(w)|Y has write permissions
			and x(w)|X(j) is a data file containing a string of at least n elements
			and m(w)|M(c) is not empty
		or if	a=read(j,c,n,s)
			and y(w)|Y has read permissions
			and x(w)|X(j) is a data file containing a string of at least s+n elements
		or if 	a=chmod(i,perm) and x(w)|X(i) != -1
Where:
y("")=y("")=x("")=m("") = "" - all component state machines start in the initial state
y(wa) =
	y(w)chmod(i,perm)	if w|Permitted(a)
				and a=chmod(i,perm)
				or a=create(path,type,perm [,device])
	y(w)			otherwise

r(wa)	=
	r(w)-1			if w|Permitted(a) = 0
	r(w)1			otherwise

m(wa) =
	m(w)copy(c,q)		if w|Permitted(a) and
				a=read(c,i,s,n)
				and q is the segment of n bytes starting at position s in x(w)|X(i)

				or if a=copy(c,q)

	m(w)			otherwise

x(wa)	=
	x(w)create(j,i,p,v) 	if w|Permitted(a) and a=create(name,type,perm,data)
				and name=p/q where q contains no / symbols
				and namei(w|X,p) contains j and only j
				and GoodDirectory(w|X,j)
				and y(w)Y(j) has write and create permissions set
				and there is no (k,q) in x(w)|X(j)
				and j = w|PickInode()
	x(w)write(i,v,s)	if w|Permitted(a) and a=write(i,c,n,s)
				and v = the n element sequence starting at m(w)|M(c)
	x(w)remove(i,j)		if w|Permitted(a) and a=delete(i) and