Any decently designed API will have ways of indicating failures, but often programmers don't use them. How can we ensure that programmers build systems that tolerate failure?
Programmers don't check for failure because checking is more work than ignoring failure, and because initial versions of a program work fine without it. Components are fairly reliable, and systems have to get pretty big before the unreliability of the components becomes noticeable. Programmers check for failure when they expect it, when they doubt that their program will work without checking. Programmers might read a Unix file without checking for error, because they know they put the file there and think that nothing can go wrong. But web programmers reading a URL will almost certainly check for an error, because they know that the web server might not have been started, the server might have gone down, or a network might have gone down. URLs are extremely unreliable. Everybody knows it, and that is why the web is so reliable.
Therefore, make sure that components fail regularly. A program that does not check for failure should not work.
One way to have components fail regularly is just to use a real system in which components fail regularly. This does not take any forsight, though most programmers working with such a system do not recognize it as a blessing. Alternatively, component failure can be simulated during testing.
I agree with Ralph on that. What would be important then is that the testbeds we will have for experimentation with ULSs imitate the failures that will happen in a real environment. In addition, I'd like to be able to write automated tests that would check whether the system works in the presence of different failure patterns and increasing degrees of failure.
- FabioKon