I’m thinking about distributed consensus algorithms, timestamping, and databases and if you read that literature you will see many references to the Fischer, Lynch, Paterson “theorem”. Google Scholar tells me the paper has been cited more than 4500 times. The theorem can be paraphrased as the following
If you cannot tell the difference between a network site or process that has failed and one that is just slow, then you can’t tell the difference between a network site or process that has failed and one that is just slow
One might respond: “surely this trivial tautology can’t be a famous result, it’s more probable that you are being hyperbolic” and that response would be 100% on the probability, but – well, here is the problem statement as set down in the paper:
The problem is for all the data manager processes that have participated in the processing of a particular transaction to agree on whether to install the transaction’s results in the database or to discard them. The latter action might be necessary, for example, if some data managers were, for any reason, unable to carry out the required transaction processing. Whatever decision is made, all data managers must make the same decision in order to preserve the consistency of the database.
A set of data manager processes must come to a consensus about whether to commit or to discard. The problem statement requires that ALL of the processes must agree either “yes” or “no” and, presumably a single “no” vote must persuade the others. An implicit but key property for the desired consensus is that if a process fails the others can ignore its opinion. That is, a dead process does not count in the consensus. And that’s the core problem here. The processes consult with each other possibly all agree on “yes” except for one process that does not answer. Is it a slow process that will say “yes” or is it a slow dissenter that will say “no” or has it crashed so its opinion can be ignored? This is a real and interesting problem – consider what happens if a router crashes and comes up 20 minutes later, after 99 processes agreed to commit a transaction and suddenly the 100th process is back on line objecting.
In this paper, we show the surprising result [my bold] that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable: it delivers all messages correctly and exactly once. Nevertheless, even with these assumptions, the stopping of a single process at an inopportune time can cause any distributed commit protocol to fail to reach agreement.
It is “surprising” that a consensus algorithm can fail if a process dies “unannounced”. Surprising because the participants in the consensus algorithm can … well they can’t
Finally, we do not postulate the ability to detect the death of a process, so it is impossible for one process to tell whether another has died (stopped entirely) or is just running very slowly.
Here’s the key phrase: “it is impossible for one process to tell whether another has died (stopped entirely) or is just running very slowly.” So the theorem shows, that given these conditions, it is impossible for the algorithm to be able to tell the difference between a situation where some participant in the protocol is very slow or that participant has failed and will never answer. Let’s back up. First processes are not allowed to time-out the consensus algorithm – it has to run until a decision is made.
We also assume that processes do not have access to synchronized clocks, so algorithms based on time-outs, for example, cannot be used.
Ok. Tautology.
If there is a lesson in FLP it is that you can’t do distributed consensus without timeouts – implicit or explicit. But wasn’t that obvious?
From a systems engineering perspective, we don’t care whether a process has crashed or not if it won’t or can’t participate in the consensus process. The obvious solution is to have something like a heartbeat message so that when the slow process rejoins the network it knows it has probably been ruled out of the consensus group and can participate in some sort of catch up protocol. This is actually a pretty simple and durable test: a process that has not been able to participate in the protocol for some period of time should consider itself suspect until it can rejoin the protocol and waiting past that period without interaction is all the evidence that the remaining processes need to conclude this process is faulty (see also the CAP principle also sometimes unfortunately called a theorem.) There are also reasons to look for semantic solutions in addition to time based solutions. FLP do not consider semantics and rule out timing. What remains is, as they note, an environment where there are unsolvable reliability problems. The big mysteries that remain however are why this result is considered so surprising and why, having noticed the problem, researchers spent so much effort attempting to produce completely asynchronous methods.
(edited Sept 3 2016 and May 2018)
use timeouts to conclude something is wrong
(In particular, the solutions in [6] are not applicable.)
That is, process A cannot assume process B will either send a response within t seconds or fail and then use its own clock to see if t seconds have passed. What can process A do?
