informal methods applied to networks and timeouts

Distributed computation involves many interesting issues concerning shared data. Here I want to sketch out what networks look like using applied (informal) mathematics so we can look at some algorithms for consensus and data consistency. No  formal methods or category theory or metamathematics of any kind have been used in making this note.

Start with a set M of messages and a set S of “sites” or site identifiers. We have a map  sources:M → S so that each message m is associated with the site that sent or can send it. We have two interesting subsets of M, Requests ⊂ M and Responses ⊂ M with Requests ∩ Responses = emptyset.  The idea is that some sites will send requests to other sites and may get responses back. Each response should be unambiguously associated with a request so there is a map req:Responses → Requests where, for m in Responses,  req(m)  is the corresponding element of Requests .

A network can be described as a collection of connected state machines that represent network sites or nodes. In this work, state machines are represented by maps f:E → X where E is an alphabet of events, E is the set of finite sequences over E, and f(σ) is the output in the state reached by following sequence σ from the initial state.  Let Null be the empty sequence so f(Null) is the initial state output. Write  (σ after a) for the sequence obtained by appending element “a” on the right of sequence  σ. The intuition is that  σ takes the system to the current state state and then an “a” event takes it to the state “after a”. For example, f(σ after a) > f(σ) means “a” caused the output to increase in value.

Sites can receive  messages, transmit  messages, and advance state just by the passage of a unit of time. Let  the set of events E include all the elements of M, each of which represents receiving that message, plus an event “transmit” which means – “transmit the message the site is attempting to send” – and “tick” which means some unit of time passed.    Here is a definition of a function to count how many times the site received any particular message.

  • RxCount(Null,m) = 0,
  • RxCount( σ after m, a) = RxCount( σ, m)+1 if  a=m
  • and  RxCount( σ after a, m) = RxCount( σ, m) otherwise.

We can also count the number of ticks since the first time a message was received.

  • TickCount(Null,m)=0,
  • TickCount( σ after a, m) = TickCount( σ,m)+1 if (a=tick AND Received( σ,m) >0),
  • TickCount( σ after a, m ) = TickCount( σ,m) otherwise

 

Suppose we have some state function tx so that tx(σ) =m when the site is attempting to transmit message m and tx(σ) =nil if the site is not attempting to send a message in the current state.  A transmit event causes  tx(σ) to be transmitted if tx(σ) is not nil.  Using tx as a parameter to a counting function, we get:

  • TxCount(Null,tx,m) = 0,
  • TxCount(σ after a,tx,m ) = 1+ TxCount(σ,tx,m) if a=transmit and m=tx(σ ),
  • TxCount(σ after a,tx,m ) = TxCount(σ,tx,m) otherwise.

We can also count how much time has passed since we first tried to transmit a message m.

  • TriedCount( Null, tx m) = 0
  • TriedCount( σ after a, tx m) = 1 if ( tx( σ) =m and TriedCount(σ, tx,m)=0
  • TriedCount( σ after a, tx, m) = TriedCount( σ,tx,m)+1 if (a=tick AND TriedCount( σ,tx, m) >0),

A “site”  s will be represented by two state dependent maps tx and g, with tx tracking transmits and g telling us whether the site has failed. We don’t specify what makes a site fail, only that it can fail and once failed, it never recovers (we can make this more smart, by specifying recovery actions later, but for now let failure be permanent).

(6) Let s be an element of our set of Sites and say (tx,g) represents site  s iff

  1.  tx(σ)in M or tx( σ) =nil
  2.  tx(σ)  in M implies source(tx( σ))=s
  3.  tx(σ) in Responses implies RxCount( σ, s,req(tx(σ))) > 0
  4.  g(σ) in {True, False} and  g(σ) implies ( tx(σ) = nil AND g(σ after a))
  5. EXISTS c> 0 so that for m in Requests where TickCount( σ, s,m) > 0 , for some r in Responses TriedCount( σ,tx,r)>0

The last requirement means that sites at lest have to try to send responses.

Suppose we have a set of such pairs (txs,gs) for s in S. The next thing to do is to determine a sequence of events σ for each s in S because each site may “see”  a different sequence of events. That is, we need to specify a mapping from sequences σ that determine the state of the composite system (the network) and sequences σdetermining the state of each component site s.

The network that contains all these sites can be itself represented as a state machine with its own event alphabet A and event sequences  σ over A.  At this point, we don’t have to know much about A, just about what happens to each σwhen an event “a” is appended to σ  over A.  Let σ = h(σ, s) and let h(Null,s) = Null so all sites start in the initial state. Then h(σ after a,  s) needs to be specified. For this system, each component can either advance by 0 or one step so

  • h(σ after a,  s)= h(σ,  s) or h(σ after a,  s)= (h(σ ,  s) after b) for some b in E.

The expression (h(σ ,  s) after b ) just means that b is appended on the right to the sequence h(σ ,  s). Now we need some rules for how the state changes in the sites relate to each other.  First, if time advances for one site, it has to advance for all of them:

  • h(σ  after a,  s) = (h(σ ,  s) after tick) implies that for all s’ in S,  h(σ  after a,  s’) = (h(σ ,  s’) after tick)

A message can be delivered to a site only if that message is sent by another site

  • h(σ  after a,  s) = (h(σ ,  s) after m) implies that h(σ  after a,  source(m)) = (h(σ ,  source(m)) after transmit) AND  txsource(m) (h(σ ,source(m))) = m

Notice that we don’t have to limit how many sites can transmit and receive or whether messages are broadcast or not. We could, alternatively require that h(σ  after a,  s) = (h(σ ,  s) after m) implies TxCount(h(σ after a ,txs,source(m)) > 0 to allow the network to deliver messages after a delay or out of order, but let’s start with something simpler.

Let Sent(σ,s,m)  be the number of times site s has sent message m and Recvd(s,m)  be the number of times site s has received message M.  A site s can be operating or failed  so let Failed(s) indicate that status.  Note σ in this context is the event sequence of the system (the network) and each   σs is the induced event sequence of site s.

  • Rcvd(σ, s,m) = RxCount(σs ,m)
  • Sent(σ, s,m) =  TxCount(σs,txs,m)
  • Failed(σ, s) =    gss )
  • Since(σ, s,m) =    TickCount(σ,m)
  • SinceTried(σ, s,m) =    TriedCount(σ,m)

The network is loosely specified but has the properties that  (1) a site is either failed or not, (2) no message can be received unless it has been sent, (3) a site can only send a message if it is the source and (4) a response won’t be sent unless the corresponding request has been received by that site.

  1. Failed(σ,s) or not Failed(σ,s) . PROOF 6.4 and definition above.
  2. Recvd(σ,s,m) > 0 implies Sent(σ,source(m),m) > 0.  PROOF by definition and 6.3
  3. Sent(σ,s,m) > 0 implies source(m) = s. PROOF from the definition and 6.2
  4. Sent(σ,s,r)>0  for r  in Responses implies Recvd(σ,s,req(r))  PROOF from the definition and induction on sequence lengths.

For the network as specified, there is no boolean state function F and sites s and z so that F(σs) implies Failed(σ,z).That is, there is no way for any site to deduce the failed state of any other site.  Suppose there was such an F. Then for some σ we have F(σ) and g(σ). Let σ =concat(u,v) where u and v are sequences and u is the shortest prefix of σ  so that Failed(σ,z). There must be some such u or else Failed(σ,z) would be false. We can define a new (tx,g) that acts just like (txz,gz)  until after σ except that g never becomes true. Define g(w) = False for all sequences w and define tx(w)=  txz(w) whenever w is a prefix of σz (or equal to σz  ). If w has σ as a proper prefix then set tx(w)=m for some arbitrary message m. Then replace (txz,gz) with (tx,g). In the new system F(σs ) is true but g(σz) is false and not only that, site z can send messages after site s has concluded that it has failed. Therefore no such F is possible.. This result is often called the Fischer-Lynch-Patterson Theorem.  To me, this result means the obvious: that without timeouts and some sort of message delivery probability, the network can’t guarantee anything much.

Now let’s informally state a property of the system that sites are obligated to try to respond to requests unless they fail.

  • .Recvd(σ,s,q) for q in Requests implies eventually either  Sent(σ,s,r) for some r so that q=req(r)  or  Failed(σ,s)

So what does “eventually” mean? Here is a property of sites.

  • 6.5.  EXISTS c>0 s.t. TickCount( σ, r) > c for r in Requests implies that (for some q, so that req(q)=r,  either (TxCount( σ, tx,q)>0) OR g(σ))

A site cannot force a response message to be transmitted, but it can try to transmit the response message and fail if its transmit request is ignored by the network for too long. Then we have a network property

  •  Recvd(σ,s,q) for q in Requests implies  ( Sent(σ,s,r) for some r so that q=req(r)  OR   Failed(σ,s) OR  Since(σs,s,q) < c ) where c is the bound from 6.5