The failure of the cooling
system at one of Japan's aging nuclear plants (built during the
70's) particularly the Fukushima Daiichi nuclear plant after the recent
earthquake illustrates some of the considerations when designing
backup and failsafe systems. We want a system to be "fail safe" because
there is a potential for harm or injury if the system does not
perform as intended.
Typically, there are two schools of
thought on systems failure. One is to determine the physics of
failure of a system, and try to mathematically model it then design
it in a way that makes it robust. This makes sense for single
components, but becomes difficult if you are dealing with complex
systems (e.g. airplanes, automobiles, nuclear reactors) that are
built from many different components - often coming from different
suppliers.
So a Boeing 747 may have an engine coming from GE, Pratt
and Whitney or Rolls Royce, or a Toyota may have an airconditioning
compressor coming from Sanden and tires from Bridgestone. In these
cases, it would probably be almost impossible to model everything, so
another school of thought is to design the system with enough
engineering safety margins and redundancy to perform adequately to
assumed worst case scenarios and to use statistics (e.g. Weibull and
other distributions) to model failure rates and predict safety
margins.
Test engineers often try to stress
existing components before they put them into critical systems and
try to verify that they will fail at a certain point, then tell the
users to only use the component way below where it normally fails. So if a motor oil is designed to last for 15,000 km, you can assume that the test engineers have verified that many cars had been tested until the oil brokedown possibly at 20,000 km or greater, giving the guaranteed figure some margin of safety.
Aside from the safety and reliability
factors built into systems by the design engineers, the operating
engineers also try to make improvements even when the systems are
already operational. These can include changes they make as the years
go by, and new technologies are developed that are better or more
reliable than older systems. Techniques that operating engineers can
use include Fault Tree Analysis (FTA) and Failure Mode Effect
Analysis (FMEA). These are basically systematic discussion and mind
mapping tools to allow engineers to share and discuss potential
problems openly and propose changes. There is always a conflict
between the engineers who wish to make safety changes, and management
who often have to weigh the cost versus benefit of these changes. But
it is always good to have your imaginative and creative thinking hats
on when doing these activities.
For example, instead of having the need
to manually push cooling rods on reactors when there is an emergency,
some systems have them drop because of gravity. In the case of the
Japanese reactors, motors and pumps had to be on to keep the water
going. Another possible improvement is to have the coolant automatically
drop by gravity because the valves open when power is lost. In this
way, a system is designed to fail safely.
One important target that design and
operational engineers need to spot is the danger of a single point
failure. This is when you only have one component (e.g. a screw, a
motor, a bearing, etc.) that can be the only thing separating you
from safety and disaster. The obvious single point failures are easy
to spot, the less obvious ones need to be worked on. If engineers
know that a particular component could be a single point failure
mechanism, they either build a redundant backup (e.g. an extra post
on a building) or make the component more reliable (e.g. make the
post stronger).
There are two approaches we often take
when it comes to the reliability of components or systems. We speak
of improving the component reliability rate (e.g. make microchips,
jet engines, nuclear cooling systems, etc. more reliable) versus
adding redundant (or backup) systems. Think of owning just one car
that doesn't breakdown (or hardly breaks down) versus having two less
reliable cars, but knowing that the likelihood that both will
breakdown at the same time is unlikely. This was the consideration
that Boeing engineers considered when they only put two engines on
their 777 model, instead of four engines like in the 747. However,
the individual engine reliability in the 777 is extremely high.
Things to remember: if you have two or
more components operating in parallel (e.g. you have two cars, or two
houses), the reliability of the combined system is greater than each
component taken individually. So if you have two houses, and one
falls down in an earthquake, you still have another house you can
move into. Of course, redundancy is always expensive. Another thing
to remember is if you have two or more components that are operating
in series (e.g. to get to work you need to take the train and the
airplane), the total reliability is less than each of the components.
If one of the components fail, automatically the entire system fails
because they are in series.
There are also what are called k out of
n systems. For example, if the Titanic had hit the iceberg head on,
only the front bulkhead would have been damaged. If it had eight
bulkheads, and one was damaged, it could have reached port with 7 out
of 8 bulkheads intact.
However, because the sailor on watch
was looking at Leonardo de Caprio and Cate Winslet, the Titanic
veered too late and the iceberg sliced through the side of the ship,
damaging several bulkheads in the process.
When the disaster doesn't strike, it is
often thought to be a statistically improbable scenario - until it
happens. But if we begin to take all scenarios, even highly unlikely
ones seriously, we will end up with very impractical and expensive
systems.
Striking a balance - not
compromising human health and safety, without ending up with a
structurally engineered doghouse, is in everyone's interest.
Of course, if an unforeseen disaster
strikes, and our favorite dog lies crushed in the rubble, we all wish
we had spent the little extra time and money to make the system a
little bit more safer and stronger.
Dennis Posadas is the author of Jump
Start: A Technopreneurship Fable (Singapore: Pearson Prentice Hall,
2009) whose latest ebook, Green Thinking fable
(http://greenthinkingfable.blogspot.com)
deals with clean energy. He was formerly at one time in his
professional career, managing equipment system reliability in the
semiconductor industry.