Typically, there are two schools of
thought on systems failure. One is to determine the physics of
failure of a system, and try to mathematically model it then design
it in a way that makes it robust. This makes sense for single
components, but becomes difficult if you are dealing with complex
systems (e.g. airplanes, automobiles, nuclear reactors) that are
built from many different components - often coming from different
So a Boeing 747 may have an engine coming from GE, Pratt and Whitney or Rolls Royce, or a Toyota may have an airconditioning compressor coming from Sanden and tires from Bridgestone. In these cases, it would probably be almost impossible to model everything, so another school of thought is to design the system with enough engineering safety margins and redundancy to perform adequately to assumed worst case scenarios and to use statistics (e.g. Weibull and other distributions) to model failure rates and predict safety margins.
Test engineers often try to stress
existing components before they put them into critical systems and
try to verify that they will fail at a certain point, then tell the
users to only use the component way below where it normally fails. So if a motor oil is designed to last for 15,000 km, you can assume that the test engineers have verified that many cars had been tested until the oil brokedown possibly at 20,000 km or greater, giving the guaranteed figure some margin of safety.
Aside from the safety and reliability factors built into systems by the design engineers, the operating engineers also try to make improvements even when the systems are already operational. These can include changes they make as the years go by, and new technologies are developed that are better or more reliable than older systems. Techniques that operating engineers can use include Fault Tree Analysis (FTA) and Failure Mode Effect Analysis (FMEA). These are basically systematic discussion and mind mapping tools to allow engineers to share and discuss potential problems openly and propose changes. There is always a conflict between the engineers who wish to make safety changes, and management who often have to weigh the cost versus benefit of these changes. But it is always good to have your imaginative and creative thinking hats on when doing these activities.
For example, instead of having the need to manually push cooling rods on reactors when there is an emergency, some systems have them drop because of gravity. In the case of the Japanese reactors, motors and pumps had to be on to keep the water going. Another possible improvement is to have the coolant automatically drop by gravity because the valves open when power is lost. In this way, a system is designed to fail safely.
One important target that design and operational engineers need to spot is the danger of a single point failure. This is when you only have one component (e.g. a screw, a motor, a bearing, etc.) that can be the only thing separating you from safety and disaster. The obvious single point failures are easy to spot, the less obvious ones need to be worked on. If engineers know that a particular component could be a single point failure mechanism, they either build a redundant backup (e.g. an extra post on a building) or make the component more reliable (e.g. make the post stronger).
There are two approaches we often take when it comes to the reliability of components or systems. We speak of improving the component reliability rate (e.g. make microchips, jet engines, nuclear cooling systems, etc. more reliable) versus adding redundant (or backup) systems. Think of owning just one car that doesn't breakdown (or hardly breaks down) versus having two less reliable cars, but knowing that the likelihood that both will breakdown at the same time is unlikely. This was the consideration that Boeing engineers considered when they only put two engines on their 777 model, instead of four engines like in the 747. However, the individual engine reliability in the 777 is extremely high.
Things to remember: if you have two or more components operating in parallel (e.g. you have two cars, or two houses), the reliability of the combined system is greater than each component taken individually. So if you have two houses, and one falls down in an earthquake, you still have another house you can move into. Of course, redundancy is always expensive. Another thing to remember is if you have two or more components that are operating in series (e.g. to get to work you need to take the train and the airplane), the total reliability is less than each of the components. If one of the components fail, automatically the entire system fails because they are in series.
There are also what are called k out of n systems. For example, if the Titanic had hit the iceberg head on, only the front bulkhead would have been damaged. If it had eight bulkheads, and one was damaged, it could have reached port with 7 out of 8 bulkheads intact.
However, because the sailor on watch was looking at Leonardo de Caprio and Cate Winslet, the Titanic veered too late and the iceberg sliced through the side of the ship, damaging several bulkheads in the process.
When the disaster doesn't strike, it is often thought to be a statistically improbable scenario - until it happens. But if we begin to take all scenarios, even highly unlikely ones seriously, we will end up with very impractical and expensive systems.
Striking a balance - not compromising human health and safety, without ending up with a structurally engineered doghouse, is in everyone's interest.
Of course, if an unforeseen disaster strikes, and our favorite dog lies crushed in the rubble, we all wish we had spent the little extra time and money to make the system a little bit more safer and stronger.
Dennis Posadas is the author of Jump Start: A Technopreneurship Fable (Singapore: Pearson Prentice Hall, 2009) whose latest ebook, Green Thinking fable (http://greenthinkingfable.blogspot.com) deals with clean energy. He was formerly at one time in his professional career, managing equipment system reliability in the semiconductor industry.