The basic standard in the industry is that the airplane and its systems should be designed such that catastrophic failure does not occur more than once in a billion flight hours. Consider that standard for a moment: one would have to fly continuously, 24 hours a day, every day, for more than 110,000 years. Yet no jet fleet has demonstrated such freedom from catastrophic accidents. For example, the B737 fleet has amassed nearly 95 million flights, which equate to about one quarter of a billion flight hours - with nearly 50 accidents categorized as full loss equivalents (see http://www.airsafe.com). Another database places the number of hull loss accidents for the B737 fleet at 104 (see http://aviation-safety.net/database/type/103.shtml).
Lu Zuckerman, an experienced reliability and maintainability engineer, explains that the disparity between theoretical safety and the demonstrated level of safety has its roots in the way safety is assessed by system, not necessarily for the whole airplane. His discourse is the product of a recent exchange of electronic messages, in which Zuckerman was asked to expound on his thesis:
"The mythical failure rate of 10-9 (one in a billion) can be addressed two ways. The FARs (Federal Aviation Regulations) require that a single point failure that can contribute to the loss of an aircraft can occur no more frequently than 10-9 and if at all possible should be designed out. The 10-9 figure that most people quote does not apply to the aircraft level but, instead, it applies to the system failure that can cause loss of the aircraft. The FARs and JARs (Europe's Joint Aviation Regulations) will specify the acceptable frequency of a system failure with effects ranging from a minor problem to loss of the aircraft. The upper limit is usually 10- 9.
"The analysis relies on the manipulation of numbers. If the regulations specify that a flap or slat system can lock up no more frequently than 10-6 (one failure in one million hours of exposure), the reliability engineer is forced to use non-realistic failure rates for the individual components that can cause lock-up, as there may be several hundred in the respective system whose failure can result in lock-up. Where do these failure rates come from? Mostly from government developed databases that contain several hundred items that may or may not be used on aircraft. In some cases the failure rate will have an upper, a median and a lower level of confidence. The analyst is free to pick whatever confidence level best fits the calculation and ultimately arrives at the desired failure rate. If the requirement is 10-9 for runaway or non-movement when flap/slat operation is commanded, then the search for usable failure numbers becomes even more ridiculous.
"So now after making the reliability calculation using non-realistic numbers the reliability engineer passes them to the systems safety engineer. The systems safety engineer then creates a FTA (fault tree analysis) which is made up of gates, the most common of which are 'AND' gates and 'OR' gates. The diagram is from the top down, meaning that the top gate is the actual failure resulting in breaching the 10-9 requirement.
"The top gate is connected to the lower gates by connecting lines, or failure paths. The failure paths leading to the top gate come from either the 'AND' gates or 'OR' gates, and each of these gates represents a failure that can lead upward to the breach in the 10-9 requirement. There can be as many gates of both kinds to reflect the system complexity. There may be as many FTAs as required to reflect all of the services that supply the system, such as hydraulics, electrical and electronics and the hardware elements of the system.
"Imagine the gates as being locks. On an 'OR' gate there can be several failures, each of which is a key to the lock and any one of these failures can pass through that gate. On an 'AND' gate, each of the failures is a key to the lock but all must be present in order for the collective failure to pass through the gate.
"This is a simplification but [is] easy to understand. Let's assume that an 'OR' gate has five failures, each of which can open the lock. The math is (1 x 10-6) + (1 x 10-6) + (1 x 10-6) + (1 x 10-6) + (1 x 10-6), with a result of 5 x 10-6 (five failures in one million hours of exposure).
"Using the same numbers, consider an 'AND' gate. The math is (1 x 10-6) x (1 x 10-6) x (1 x 10-6) x (1 x 10-6) x (1 x 10- 6), with a result of 1 x 10-30.
"These calculations are unrealistic because they bear little, if any, relevance to the operating environment or to actual recorded failures. Let me explain. Because the mean time between failure (MTBF) is dictated by the certification authorities at the system level, the MTBFs for system elements and parts thereof are apportioned downward. This means that failure rates at the piece part level must be selected in order to attain the necessary failure rate at the component level. In this way the ultimate number will meet the MTBF requirement but the original part failure rate has nothing to do with the aircraft application.
"This is not true for electronics because of the millions of histories generated for all types of avionics circuits and components.
"Here is the kicker. The FTAs are for systems and not the aircraft. Each FTA terminates in assessing the probability of failure of the specific system. This process should be carried one step further by making a FTA with an 'OR' gate representing the aircraft, with each of the systems feeding into that gate. Having a final 'OR' gate will provide a truer picture of the catastrophic failure rate at the very top level. Because it is an 'OR' gate, one would most likely come up with a catastrophic loss rate in the area of 1 x 10-8 (one in 100 million hours of exposure) or possibly lower - not 10-9 - which more truly reflects the crash rate of commercial aircraft. People fixate on the 1 x 10-9 failure rate thinking it is at the aircraft level when it is in fact at the system level.
"However, the Federal Aviation Administration does not require this assessment at the aircraft level. So much for safety."
Support for Zuckerman's argument comes from a 2001 presentation at the National Aeronautics and Space Administration's System Safety Center irreverently titled "A Charlatan's Guide to Quickly Acquired Quackery," subtitled "The Trouble With System Safety." The author of this paper, one P. L. Clemens, said, "The hazard inventory techniques ... view risk hazard-by-hazard. If individual hazards pose acceptable risk, system risk is judged acceptable. So, a large inventory of individual hazards can be disguised as a 'safe' system - even though in reality it may portend a grim disaster!"
A timely illustration of a final 'OR' gate to assess safety at the aircraft level is contained in a March 12 joint letter from the Aerospace Industries Association (AIA) and the General Aviation Manufacturers Association (GAMA) to the Aging Transport Systems Rulemaking Advisory Committee (ATSRAC). The letter's authors argued that the ATSRAC effort to define wiring - more precisely the electrical wiring interconnection system, or EWIS - as a separate system "effectively doubles the risk to the fleet." It is an objection to the final "OR" gate Zuckerman recommends for safety assessment at the aircraft level. Zuckerman, e-mail firstname.lastname@example.org
I'm not sure where the confusion comes from, but there has never been a regulatory requirement "that the airplane and its systems should be designed such that catastrophic failure does not occur more than once in a billion flight hours". The regulatory requirements have always addressed individual failure modes.
While it would obviously be safer to apply the one in a billion requirement to the whole aircraft, that is simply not achievable. Let's face it, this is much, much more stringent than the current standard. The aircraft that resulted from such a design standard would cost so much and be so heavy that no airline could make money with it. We might as well shut the industry down - at least that is safe.
I do support the concept or requiring a certain "whole aircraft" level of safety, but one in a billion is way too high a target.
I had to read this a few times. The article starts our referencing a one-in-billion standard for "the airplane and its systems", but later, Zuckerman states, "However, the Federal Aviation Administration does not require this assessment at the aircraft level."
As an industry outsider, I am surprised and troubled to learn that standards apply only at the system level. I won't quibble with you, RV8, about what the standard should be, but it sure seems to me that there should be one.
As it's explained in the article, the (regulatory) standards would appear to be way too easy to manipulate. If a "system" is composed of three subsystems, each of which (barely) meets whatever standard is at work, then the system as a whole would be 60% below spec. But if each subsystem can be defined as a separate system, the problem goes away. Who decides what comprises a "system"?
The lack of an overall aircraft standard raises another issue, particularly in today's environment, wherein carriers are performing complex modifications like entertainment systems, in-flight computer power systems, internet access etc. Each new system introduces some incremental probability of catastrophic failure, thereby increasing the probability for the entire aircraft. Add enough stuff ...
I don't think it would be prohibitively expensive if there was a baseline whole aircraft standard that at least prevented creeping degradation of failure rates as new systems are added.
From this week's Air Safety Week regarding 'Mythical Safety' A Pilot's view:
The Geometric Curve of Risk
Capt. Paul Miller, Safety Committee Member, Independent Pilots Association
In the area of 'risk' permit me to add a few points from a pilot's point of view to engineer Lu Zuckerman's perspective of 'mythical safety' (see ASW, July 28).
1. Risk is a mathematical product of probability of failure times the severity of the failure. Risk (Z) is equal to probability (X) times severity (Y), or Z = XY. I call this relationship Miller's Safety Formula. Set aside the negative values of X and Y and only consider the positive (+) values of X and Y, in other words only positive probability and positive severity. This results in a positive Risk (Z) value, a very useful segment of the overall product. The resulting risk data lies on a continuous surface, curving upward from the origin like a bent playing card.
The surface is somewhat flat near the origin yet quickly curves upwards. What the curved surface tells us is that risk is a product and that if we allow the product to continue to multiply, it will increase rapidly in a manner similar to a geometric curve.
Z = XY will rapidly increase because the factors X (probability) and Y (severity) are independently variable.
Therefore, in order to reside on the flat part of the risk curve, which I would label the safe part of air operations, we need to rapidly resolve issues of probability and severity as soon as they are discovered. In other words, rapid resolution is as important as the resolution itself!
If we all would like to reside in the flat part of the risk curve in relative safety, then we will have to become much more rapid problem solvers.
2. Risk cannot be pinned solely on probability of failure of a mechanical piece, part or system. As a pilot, I am so glad the Federal Aviation Administration (FAA) is pushing for more reliable parts and systems. On a recent night B767 flight eastbound from Bombay to Hong Kong, I want to say that I felt a big debt of gratitude to all of the engineers who made my great plane! There are not many landing fields amidst the cyclones and miles of open ocean.
By the same token, the National Transportation Safety Board (NTSB) has gone to great lengths to point out that human error (pilot, maintenance and supervisory) is the root cause of air disasters. Even mechanical failures have been traced to human error in manufacture, installation and maintenance. In other words, it is not the broken part that does us in. It is flying with a 'known broken part' repeatedly that finally does us in.
So human error is as much, if not more, of the equation as bench test engineering failure data. How we pilots handle broken parts and how mechanics are told to defer broken parts and how operational pressure causes us to operate with broken parts are all in the probability equation mix.
3. If we are to believe the NTSB, then we must acknowledge that human error (read 'human factors') is a critical part of the accident equation and therefore a critical part of the accident prevention equation.
4. What more debilitating human factor issue is there amongst line flight crew than fatigue? In what one area of safety has the FAA been more reluctant to lead than flight crew fatigue? Metal fatigue is given generous research attention by the FAA, but human fatigue is all but ignored and, until recently, its existence was hardly recognized as a causal factor in accidents. It is poorly studied and it is poorly regulated. Line flightcrews are left to their own defenses, with contractual language the only bulwark against 16 hour+ day fatigue inducing operations. The FAA has said line flight crews should self-police fatigue, holding the crews themselves responsible for being fatigued and operating fatigued! Human factors is more important than it has been recognized so far by regulators in accident prevention.
5. Probability can only be expressed mathematically as much as it can be measured mathematically. The truth is that it is more often estimated, concluded from averaging data, deduced or even induced in engineering studies. Perhaps the full range of probability and severity should be looked at instead, and presented to management when trying to make a 'go-no go' decision. While the probability data may show a low value, the severity data may be extremely high, causing the risk value to be much higher than the probability would indicate.
6. Risk equals dollars. If you bet big you can loose big, but with today's seat revenues, you really can't win big by operating a heightened risk flight.
So risk is really the measuring yardstick, not probability. (ASW note: Capt. Miller last appeared in this publication Sept. 6, 1999, p. 10, 'A Pilot Perspective on Maintenance & Safety') Miller, e-mail PaulLMiller44@cs.com
|Powered by Social Strata|