ULS Workshop Wiki


FailureIsConstant

It's not a matter of what to do when the system fails, but a matter of "hundreds of components are failing all the time" how can we design and implement software to live well with that? -FabioKon.

I can think of two examples of systems that already do it. One is the world-wide-web. One of the reasons that it works so well is that it assumes that just because a URL worked a minute ago is no reason to assume it will work now. Failure is built into the communication model. Moreover, the system really DOES fail, so if someone mistakenly tries to build a system assuming that communication is reliable, they are quickly set straight. Because the web is like this, web services tend to be like this, as well. When you build a system on web services, you know the odds that your components will fail is high. It is not like Unix programs, where many C programmers assume that a write will always succeed and do not check for errors. They get away with it because file systems don't fill up very often, and that is the main reason a write fails unexpectedly in Unix. This shows the importance of a pattern I'll call CreateFailure.

Another example is Erlang. One of the reaons it is so good at building reliable systems is that failure is a key part of Erlang interprocessor communication. One of the expected results of every message is "receiver died". Processes are independent from each other, so failure propagates only when you want it to. The possibility of failure and the independence of processes is built into the langage. Joe Armstrong (the author of Erlang) has a whole bunch of patterns that are used to limit failure, and I'd like to learn them. I am not proposing Erlang as a solution to ULS because no programming language will be. However, its model of interprocess communication is similar to that of web services and is just what is needed in a ULS. I bet that many of Joe Armstrong's patterns would be important in a ULS, as well.

-RalphJohnson