Avoidable mistakes are still mistakes, and mistakes happen, even when they shoul...

Avoidable mistakes are still mistakes, and mistakes happen, even when they shouldn't.

There is no one thing you could do to avoid catastrophic failure. The best thing you can do is break your system up entirely and simply never do any change which can affect the entire system. Which is still very hard to do.

Imagine you have a route that crosses all of Canada. For each end of Canada you want at least 3 separate networks that can each carry 1/2 of the traffic. On failure of one network, somebody has to take that failed network's traffic and send half to each of the two remaining networks. But if somebody fucks up and sends that traffic to one of the networks, that network will be flooded and inoperable, and the first network also will be inoperable, leaving only one network with 1/3 the country's traffic online. Whatever customers designed their own stuff to depend on the first or second network is also screwed. This is just one of a million different scenarios that needs complex controls to prevent a (mostly) catastrophic failure.

There is nobody using formal proofs to show the design makes sense, so it's just humans winging it and hoping their contingencies work out.