Beyond the Widget: Columbia Accident Lessons Affirmed

image

*The author was a member of the Columbia Accident Investigation Board and would like to acknowledge the support and ideas contributed by many of its members and staff, particularly Maj Gen Ken Hess, Lt Col Rick Burgess, Lt Col Larry Butkus, Cdr Johnny Wolfe, and Dennis Jenkins.

SpaceRef note: this article originally appeared in Air & Space Power, Summer 2004.

The date 1 February 2003 presented the world with images that will be forever seared in memories of all viewing the images of the space shuttle Columbia's final moments as it broke apart in the skies over Texas. As tragic as the Columbia accident was, multiple lessons to prevent future accidents can be "affirmed" from the circumstances surrounding this accident. The emphasis is on "affirmed," because all of those lessons had been previously learned during the past 40 years through the analysis of other tragedies:

  • April 1963, loss of the USS Thresher, while operating at the edge of several envelopes
  • January 1967, Apollo I capsule fire on launchpad
  • December 1984, Union Carbide pesticide factory tragedy in Bophal, India, resulting from insufficient attention to maintenance and training, and its leadership ignoring internal audits
  • January 1986, loss of the space shuttle Challenger
  • April 1986, Chernobyl disaster, where safety procedures were ignored during reactor testing
  • July 2000, crash of a Concorde supersonic passenger plane in Paris after multiple prior incidents
  • September 2001, al-Qaeda attacks on the United States despite more than a decade of uncorrelated signals and warnings
  • October 2001, Enron collapse, despite multiple warnings and indications

The lessons gleaned from these and other prominent accidents and disasters, management and leadership primers, and raw experience are the same lessons that should have prevented the Columbia accident. The saddest part is that some in the National Aeronautics and Space Administration (NASA) had simply not absorbed, or had forgotten, these lessons; the result was the deaths of seven astronauts and two helicopter search team members, as well as the intense scrutiny of a formerly exalted agency.

This article highlights many of the major lessons affirmed by the Columbia Accident Investigation Board (CAIB) - lessons that senior leaders in other high-risk operations should consider to prevent similar mishaps and to promote healthy organizational environments. Admittedly NASA-specific and greatly condensed, the specific Columbia-related vignettes and perspectives presented here are intended to provide the reader an opportunity to step back and contemplate how his or her organization has the potential to fall into the same type of traps that ensnared NASA. Due to NASA's size, complexity, mission uniqueness, and geographically separated structure, some specific lessons may not be applicable to all organizations; however, the fundamental principles apply universally, as many of these same conditions may be present in any organization.

Effective leaders recognize that every organization must periodically review its operations to avoid falling into complacency as NASA had done. They also recognize that it is far better to prevent, rather than investigate, accidents. To assist with that prevention, readers should carefully examine the situations in which NASA found itself, perhaps drawing relevance by substituting their own organization's name for "NASA," and affirm those lessons once again. These situations are organized and examined in the three categories of basics, safety, and organizational self-examination.

We are what we repeatedly do. Excellence, then, is not an act, but a habit.

- Aristotle

Sticking to the Basics

The reason basics are called basics is that they form the foundation for an organization's success in every field from plumbing to accounting to technology-intensive space launches. As NASA and the world shockingly discovered, deviating from basics can form the foundation for disaster.

Keep Principles Principal

Avoid Compromising Principles. In the 1990s, the NASA top-down mantra became "Faster, Better, Cheaper." The coffee-bar chat around the organization quickly became, "Faster, Better, Cheaper? We can deliver two of the three - which two do you want?" While the intent of the mantra was to improve efficiency and effectiveness, the result was a decrease in resources from which the institution has yet to recover.

Leaders must contemplate the impact of their "vision" and its unforeseen consequences. Many must also decide whether operations should be primarily designed for efficiency or reliability. The organization and workforce must then be effectively structured to support that decision, each having a clear understanding of its role.

Leaders must remember that what they emphasize can change an organization's stated goals and objectives. If reliability and safety are preached as "organizational bumper stickers," but leaders constantly emphasize keeping on schedule and saving money, workers will soon realize what is deemed important and change accordingly. Such was the case with the shuttle program. NASA's entire human spaceflight component became focused on an arbitrary goal set for launching the final United States Node for the International Space Station. They were so focused, in fact, that a computer screen saver was distributed throughout NASA depicting a countdown clock with the months, days, hours, minutes, and seconds remaining till the launch of the Node - even though that date was more than a year away. This emphasis did not intend to change or alter practices, but in reality the launch-schedule goal drove a preoccupation with the steps needed to meet the schedule, resulting in an enormous amount of government and contractor schedule-driven overtime. This preoccupation clouded the institution's primary focus - was it to meet that date, or to follow the basic principles of taking all necessary precautions and ensuring that nothing was rushed?

Don't Migrate to Mediocrity. A glaring example of backing off of basics was in the foreign object damage (FOD) prevention program at Kennedy Space Center (KSC). KSC and its prime contractor agreed to devise an aberrant approach to their FOD prevention program, creating definitions not consistent with other NASA centers, Naval reactor programs, Department of Defense aviation, commercial aviation, or the National Aerospace FOD Prevention, Incorporated, guidelines. In the KSC approach, NASA implied there was a distinction between the by-products of maintenance operations, labeled processing debris, and FOD-causing foreign object debris. Such a distinction is dangerous to make since it is impossible to determine if any debris is truly benign. Consequently, this improper and nonstandard distinction resulted in a FOD prevention program that lacked credibility among KSC workers and one that allowed stray foreign objects to remain present throughout shuttle processing.

In devising a process that ignored basics, they created conditions that could lead to a disaster. Their new definitions ignored the reality that the danger generated by debris begins while the job is in progress. Although the contractor espoused a "clean as you go" policy, the elimination of debris discovered during processing was not considered critical, causing inconsistent adherence to that policy. Both contractor and KSC inspectors reported debris items left behind on numerous occasions. The laxity of this approach was underscored by the loss of accountability for 18 tools used in the processing of the Columbia orbiter for its doomed Space Transportation System (STS) mission STS-107. In the aviation world, the concern lies with foreign object ingestion into jet engines, interference with mechanical control mechanisms, and the like. If such items remain undetected aboard a shuttle, which is then launched into a microgravity environment, they create a great potential for harming shuttle systems or other objects in orbit - regardless of whether those items are classified as process or foreign object debris - their KSC-assigned terrestrial definitions. The assumption that all debris would be found before flight failed to underscore the destructive potential of FOD and created a mind-set of debris acceptance.

In another migration to mediocrity, NASA had retreated from its supposedly routine analysis of shuttle-ascent videos. After noting that foam from the external tank's left bipod ramp had struck the Columbia during its launch, part of dismissing the danger resulted from the NASA statement that this loss marked only the fifth time in 113 missions that foam has been lost, roughly a one in 23 chance of occurrence. The CAIB, however, directed a full review of all existing shuttle-ascent videos, revealing two previously undiscovered foam losses from the left bipod ramp. Peeling the onion back even further, the CAIB evaluated how many missions actually produced usable images of the external tank during launch. Due to night launches, visibility, and external-tank orientation, images were available to document only 72 of the 113 missions. Thus, the failure to perform the "basic" and routine imagery analysis hid the actual severity of the problem; the seven left bipod ramp foam losses in 72 observed missions more than doubled the previously stated NASA odds of one in 23 to one in 10. Had the film-analysis program been consistent over the history of the shuttle program, perhaps NASA would have detected and fixed the foam-loss problem sooner.

Maintain Checks and Balances. A glaring example of where KSC faltered in its checks and balances lay in the administration of its government quality assurance (QA) program as maintenance changed to a contractor-run operation. Hardware inspections by government inspectors had been reduced from more than 40,000 per launch to just over 8,000. If properly managed, this level of inspection should suffice, as the contractor assumed more responsibility and had a strong program that relied heavily on the technicians' skill. However, that was not the case. For example, government QA inspectors were not permitted to perform some of the basics in their job descriptions - to include unscheduled "walk around surveillance." Indeed, one technician, having completed such surveillance, discovered a "Ground Test Only" (not-for-flight) component installed on an orbiter main engine mount prior to flight. Although his job description called for such inspections, that technician was threatened for working "out of his box." An attempt to confine such surveillance to statistically driven sampling parameters underscored a lack of experience and a lack of understanding of the purpose for such surveillance. It also served to handcuff the QA inspectors and the program's effectiveness.

While other examples exist, it suffices to say that checks and balances using "healthy tensions" are vital to establish and maintain system integrity in programs from the federal government to aviation. High-risk operations dictate the need for independent checks and balances. To further this approach, leaders must establish and maintain a culture where a commitment to pursue problems is expected - at all levels of the program and by all of its participants.

Mere precedent is a dangerous source of authority.

- Andrew Jackson

Avoid an Atrophy to Apathy. An organization should not invent clever ways of working around processes. For example, NASA created an ad hoc view of the anomalies it had experienced and then deemed subsequent anomalies as either "in family" or "out of family," depending on whether an anomaly had been previously observed. This led to "a family that grew and grew" - until it was out of control. This ad hoc view led to an apathy and acceptance of items such as the Challenger's solid rocket booster O-ring leakage and the foam strikes that had plagued the shuttle since its first mission but, until Columbia's demise, had never brought down an orbiter.

Control Configuration Control. The space shuttle is a magnificent system that has produced six orbiters - each differing from the others in multiple aspects. With only six orbiters, one might expect the use of an intricate method for tracking configuration changes of such things as wiring systems, control systems, and mounting hardware, likely augmented with extensive digital photos. That was not the case with the shuttle program, calling into question everything from the condition of orbiter components to the assumptions made on the shuttle's center-of-gravity calculations.

Leaders must insist on processes that retain a historical knowledge base for complex, legacy, and long-lived systems. Configuration waivers must be limited and based on a disciplined process that adheres to configuration control, updated requirements, and hardware fixes. If workers at the lower level observe senior leaders ignoring this path, routinely waiving requirements and making exceptions to well-thought-out standing rules, they too will join the culture of their seniors and begin accepting deviations at their level - adding significant risk to the overall system. Senior leaders must also ensure the steps required to alter or waive standing rules are clearly understood.

Avoid "Fads" - Question Their Applicability. Although bombarded by "management by objectives"; Deming-driven, off-site quality thrusts; and "one-minute-management" techniques, leaders must ensure that the latest "organizational fad" does not negatively influence their operations. For example, the ISO-9000/9001 sampling processes mandated in the NASAņUnited Space Alliance (USA) contract are based on solid principles and are appropriate for many processes and organizations worldwide. These principles would work well in a manufacturing process producing 10,000 bolts a day, or at a scheduled airline where a technician may perform the same steps dozens of times per week. However, the same principles do not necessarily apply in an environment where only three to six flights are flown each year, and workers may accomplish certain processes just as infrequently. Process verification must be augmented when critical operations take place with an "eyes-on, hands-on" approach, which was not happening in the shuttle program.

The KSC approach also had an emphasis on process over product; that emphasis was exemplified by employees, unaffectionately labeled Palm Nazis, who wandered the Orbiter Processing Facilities with personal digital assistant devices, sampling to verify that every step of a maintenance process was followed. This sampling approach certainly ensured the steps they checked were completed, which created a false sense of security with an equally false assumption - that verifying a process was followed would ensure that the product was perfect. Nothing could be further from the truth - the steps may have been insufficient, lacking required definition and depth, or improperly accomplished.

Keep Proper Focus. When launching a space shuttle or conducting any operations where safety is paramount, every operation should be unique; there is no such thing as routine. The CAIB discovered many within NASA had the attitude that putting a shuttle into orbit was just something that NASA did. In fact, the attitude should have been that "putting a shuttle into orbit is not something we do, it IS what we do." In testimony before the CAIB, Dr. Harry McDonald, who headed the 1999 Shuttle Independent Assessment Team, stated that NASA had drifted, and his conviction that NASA should go back to its previous days of excellence and toward a shuttle focus. He underscored his position by saying, "Each launch should be treated as the first launch, each orbit as the first orbit, and each reentry as the first reentry."1

When an organization adopts a mind-set that allows the most important thing that they do - their primary and most visible reason for existence - to become "just another operation," the focus of that portion of the organization is lost. An organization cannot let this happen, particularly when dealing with the safety of human lives or national assets such as the shuttle fleet. In an era of declining budgets, and with organizations looking for ways to compete for other business, organizations should avoid removing focus from the real goal or the "thing" that they are expected to do. When this happens, organizational focus can be lost and "what" they do simply becomes "something" they do.

A primary task in taking a company from good to great is to create a culture wherein people have a tremendous opportunity to be heard and, ultimately, for the truth to be heard.

- Jim Collins
Good to Great

Communicate, Communicate, Communicate

Leaders Must Insist on Discussion. NASA's heavy-handed management of meetings, using a rigid protocol, discouraged an open discussion of concerns, resulting in a failure to properly investigate those concerns. The senior executive service (SES) leaders at the meeting table did not seriously encourage inputs from the lower-ranking government service (GS) engineers on the room's periphery; however, it was these GS engineers who saw the potential danger from a foam strike to the Columbia.

Leaders not only must ask for inputs, but also must place a heavy emphasis on communication and encourage both consent and dissent. In fact, certain successful leaders of risky operations admit that they are uncomfortable if there are no dissenting opinions when important and far-reaching decisions are considered.

Encourage Minority Opinions. The minutes and audiotapes of NASA's Mission Management Team reflect little discussion other than that emanating from the head of the conference-room table. Expressions of concern that the foam impact might affect the integrity of the orbiter were quickly refocused to a discussion of how much additional maintenance might now be needed to prepare Columbia for its next flight.

Successful and highly reliable organizations promote and encourage the airing of minority opinions, such as those of the NASA engineers seeking to express their concerns with the foam strike. Leaders must acknowledge and exercise their own responsibility to conduct a thorough and critical examination, and remain cautious so as not to create an environment where they are perceived as ignoring inputs or no longer willing to hear about problems.

Leaders should listen and listen and listen. Only through listening can they find out what's really going on. If someone comes in to raise an issue with the leader and the leader does not allow the individual to state the full case and to get emotions out in the open, the leader is likely to understand only a piece of the story and the problem probably will not be solved.

- Maj Gen Perry M. Smith
Taking Charge

Conduct Effective Meetings - Transmit and Receive. Transcribed and audio evidence revealed that many NASA meetings were inconsistent and ineffective. For example, a critical meeting that was required to occur daily when a shuttle was on orbit was held only five times during the 16 days Columbia's astronauts were aloft. The heavy-handed management of many meetings limited input and discussion; voice tapes also revealed the tone of the leader's voice could be intimidating. The perfunctory, matter-of-course requests for inputs were often phrased more as a statement than as a solicitation, akin to, "Then there is no dissent on that point, is there." Period.

To be effective, meetings should have agendas and satisfy the requirements of a governing directive (such as their frequency when a shuttle is on orbit). An effective leader will elicit and listen to all opinions, evaluating carefully the possible substance. Leaders should promote respect for each participant's independence and value his or her contributions. An effective meeting leader will ensure each attendee the opportunity to contribute and not allow the person with the loudest voice to dominate the discussion. The leader should be inquisitive and ask questions about items that are not clearly presented, penetrating below the surface of glib marketing presentations - that emphasize the medium over the message, using fancy graphics, transitions, and so forth - and demanding backup data or facts. An effective meeting leader should encourage others to ask questions - knowing that if the leader doesn't understand something, the chances are others may have the same questions. Indeed, participants observing a leader who is comfortable enough to ask questions may be prompted to do the same.

During NASA meetings, a final problem occurred when an individual with expertise in one arena was incorrectly purported to have expertise in another critical arena. His forceful personality and familiarity with the meeting's leaders tended to quash dissent. It is the responsibility of the meeting leader to ensure its integrity - particularly during decision sessions - being aware of those with false or presumed expertise, and instead seeking out and listening to actual authorities with expertise.

Ensure Management-Information Systems Matter. As the CAIB discovered across NASA and its centers, older, legacy management information systems that did not interface with each other made problem identification very difficult. These systems had become dysfunctional databases - too burdensome to be effective, too difficult for average workers to use and interpret data, and, in the case of foam loss, simply nonexistent. Although the CAIB found that NASA tracked multiple metrics, the impression was that many of these simply served as "eyewash" or as one small piece of a huge pie that was irrelevant to or uncorrelated with the total picture. With multiple information centers, systems, and databases for trend analysis, senior leaders could not ensure appropriate metrics were tracked or, more importantly, that they were even used.

To avoid developing a focus on metrics for metrics' sake, the quantity being measured must be understandable, applicable, measurable, and the goal must be attainable. Ideally, there should exist a process that consolidates and assimilates data from multiple databases, providing a comprehensive picture of system performance, costs, malfunctions, and other trends of utility to management.

Avoid "Organizational Arrogance." The CAIB conducted a review of more than 80 past reports of studies that related to the shuttle program and its management, focusing on the reports' findings, recommendations, and NASA's response.2 The revelations from that review were disturbing - NASA would essentially pick and choose the third-party inputs to which it would listen and respond. It made only incremental changes and then only to those things it saw fit to change, rarely letting such third-party concerns filter down to its line workers. NASA seemed to say, "We know what we're doing, so thanks for your input to human spaceflight." At a more grassroots level, evidence revealed that KSC decision makers had routinely ignored or shelved inputs from KSC line workers with 15 to 20 years of shuttle program experience. This often created dysfunctional situations - line workers redirected their ignored inputs to NASA Headquarters using the NASA Safety Reporting System (NSRS). The NSRS was established in 1987 after the Challenger shuttle mishap, and is an anonymous, voluntary, and responsive reporting channel to notify NASA's upper management of concerns about hazards. Those previously ignored concerns were often validated, mandating a headquarters-directed fix rather than one locally implemented and managed. More than a year after the Columbia accident, NASA had still not come to grips with ensuring experts' opinions were acknowledged - a March 2004 145-page report that included employee surveys reflected open communication was not the norm within NASA, that employees did not yet feel comfortable raising safety concerns to management, and that the raising of issues was still not welcomed.

Senior leaders must avoid insulating themselves (or even giving the perception of insulating themselves) from third-party inputs, workers, engineers, and operators - regardless of their position in or with the organization. Everyone's opinions deserve respect and should be given consideration. However, there must obviously be balance. The more hazardous the operation, particularly when lives are at risk, people will naturally examine every facet of the operation more closely and see more reason for concern. In NASA's case, for example, it is entirely conceivable that endless concerns could be raised through internal questions and outside reviews, allowing operations to be halted while every minute safety question is addressed in perpetuity. If this were allowed to become the norm, NASA might never again fly an aircraft test sortie, much less a shuttle mission. Thus, the key lies in accurately assessing and accepting calculated risks for such research and development systems - not reacting to every conceivable, abstract safety concern in a manner more appropriate to airliners that are expected to routinely and safely fly families and cargoes.

Be Thorough and Inquisitive

Avoid Leadership by PowerPoint. In our "sound bite" world, short and concise briefings have increasingly become the order of the day, especially the higher in the management echelon one resides. NASA management meetings were found to have greatly condensed briefings, sometimes boiling a 40-slide engineering analysis down to a single slide (the potential impact of a foam strike on the orbiter is one notable example). In other instances, the slide(s) presented would have factual errors or errors in assumptions that an expanded briefing or technical data may have eliminated (the case of the history of foam strikes and external tank modifications is one such example). Multiple examples of key NASA decision briefings were lacking in the rigor to explain or even identify key assumptions, ranges of error and variability, or alternative views.

Used properly, briefings and slides are certainly suitable tools for high-level summaries and decisions, but as a complement to and not a replacement for thorough analytical research and processes. Leaders must avoid using briefing slides as the sole means of transferring information during critical operations or for formal management decisions.

Leaders who have adopted a 10-slide briefing limitation for presentations may have done so because it was "how they were brought up," and it is their belief that it adds discipline and removes unnecessary information. However, they must also realize that they are not getting the full story - rather, they are getting a distilled view of what their subordinates have chosen to present to them. This practice could be acceptable if the decision maker were certain that a rigorous process had preceded the briefing - one that had thoroughly examined the issues and asked all the correct questions. That, however, was not the case with NASA. In some instances, the necessary data had been cast aside, or, worse, not even sought. Competent leaders realize that they are accountable for the results of the actions of their organization and realize that if there's any doubt, they must insist on getting enough information (even the complete story) to convince themselves of the integrity of their processes.

Mandate Missouri Mind-Sets ("Show Me!"). A healthy pessimism is required in high-risk operations. During prelaunch operations, NASA seemed to demonstrate a healthy pessimism, questioning deficiencies that could affect the mission and exhibiting an attitude of "prove to me this is safe." However, after launch, that attitude seemed to be recast to "prove to me it's unsafe," meaning that if the engineers and managers did not produce solid evidence to support their concerns, those concerns were quickly subordinated to mission accomplishment.

Disregarding engineers' concerns also subdued a healthy curiosity. Although the external tank was known to shed large chunks of foam, the postlaunch, debris-strike damage assessment done for Columbia while it was on orbit relied on test data and analytical models for relatively miniscule foam projectiles. However, "what if large pieces of foam hit the orbiter?" was a question no one had been motivated to ask or answer - not after the first loss of a large piece of foam on STS-7, and not after the loss of a much larger piece during STS-112's October 2002 ascent that hit and damaged the solid-rocket booster. As a result, no viable analytical models had been developed or test data collected for large foam-debris strikes.

After the Columbia tragedy, NASA was originally entrapped into believing and even evangelizing that foam could not hurt the orbiter. One reason was that NASA became enamored with an "analysis by analogy," publicly stating that a foam strike was akin to the Styrofoam lid on a cooler in the bed of a pickup that is traveling on the road ahead of you suddenly flying off, striking your car, and harmlessly breaking apart. Although making superficial sense, it was an approach proven dramatically and tragically faulty. As an analogy, it ignored the basic physics - kinetic energy (KE) of the foam (KE= 1š2mv2) - of a 1.67-pound piece of foam breaking off a rocket body traveling at nearly Mach 2.5 and decelerating to a differential speed of approximately 500 mph before encountering Columbia's wing. Indeed, there were those who were not convinced until 7 July 2003, when a test replicated those conditions.

In that "show me" test, the engineers at the Southwest Research Institute fired a 1.67-pound piece of foam at 500 mph, shattering a hole in an orbiter's wing panel. In short, a preference for a clever analogy can serve as a recipe for repeating catastrophic mistakes, whereas insistence on analysis over analogy can prevent potentially disastrous situations.

Can't get an "ought" out of an "is."

- G. E. Moore

Question Untested Assumptions. Leaders must be careful not to rely on untested basic system certification as the "end-all solution" to approaching problems. Originally, the space shuttle's leading-edge, reinforced carbon-carbon (RCC) panels were arbitrarily certified for 100 missions; therefore, there was no perceived integrity problem due to the aging of the panels. While engineering and design criteria were exhaustively incorporated into the shuttle, no similar system existed to revalidate and recertify the RCC design assumptions or to check the progression of unforeseen problems, such as micrometeoroid strikes, pinholes, corrosion, oxidation, and other effects detrimental to those critical leading-edge RCC panels.

In another example of untested components, the faulty external tank foam had never been dissected - especially the foam applied in the bipod ramp area that came apart and hit Columbia's leading edge. The dissection of several different bipod ramps, accomplished at the direction of the CAIB, revealed voids, gaps, and even debris - any one of which could have contributed to the bipod-ramp foam losses that occurred roughly once in every 10 missions. However, NASA had never pursued evaluating the foam beyond simple pull tests to check adherence to the external tank, eddy current tests to verify the foam thickness, and chemical composition checks.

To ensure it employs technology over technique, an organization must, if possible, certify all critical hardware through testing - not just analysis. However, if analysis must be used, it should be verified by testing. For example, even today's computerized aircraft-design process does not eliminate the necessity for flight-testing. Using certified test techniques to inspect critical hardware during maintenance turnaround and upgrading those techniques as new test technologies emerge, should be standard procedure. Examples were found of NASA failing to use modern technology to facilitate its testing. CAIB members were astonished to find 1960s- and 1970s-era test equipment while visiting NASA work centers. Although it might still work for its original purpose, today's digital equipment offers a more accurate, maintainable, reliable, and economical methodology.

Ensure Taskings and Resources Balance. Leaders must be willing to stand up and say "No" when tasked to operate or function without sufficient resources, risking their own careers, if necessary. Perhaps former shuttle program managers and center directors should have resigned in protest years ago for being unable to safely support the shuttle and International Space Station (ISS) programs with congressionally approved budgets, personnel, and resources. When leaders become convinced, using objective measures, that their taskings and resources are out of balance, it is their duty to make their concerns known, act appropriately on their convictions, and ensure those concerns are consciously addressed. Such objective measures are critical, however, for NASA has shown - as recently documented in an April 2004 General Accounting Office report to Congress - that it could not provide detailed support for the amounts it had obligated in its budget requests. Safety First - and Always

Discovering these vulnerabilities and making them visible to the organization is crucial if we are to anticipate future failures and institute change to head them off.

- D. D. Woods and R. I. Cook
Nine Steps to Move Forward from Error

Illuminate Blind Spots

A key to safe operations is to eliminate all potential blind spots - areas that are not seen or subject to examination and from which unforeseen problems might arise. Their danger is that they are invisible until identified by someone with a different vantage point or opinion. NASA allowed itself to evolve into an organization with inconsistent authority and responsibility in its safety structure, exhibiting marked differences between and even within its centers. Along the way, it had also transferred some of this inherent safety responsibility to contractors - creating governmental blind spots.

Leaders must always be on the lookout for these weaknesses and other safety shortfalls. It is imperative to have a safety organization, or similar office, whose goal is to search out and identify blind spots - those potential problem areas that could become catastrophic.

You need an established system for ongoing checks designed to spot expected as well as unexpected safety problems. . . . Non-HROs [non-high-reliability organizations] reject early warning signs of quality degradation.

- Karlene H. Roberts
High Reliability Organizations

Stop Stop-Gap Safety

While NASA can boast some of the most effective industrial safety programs in the world - the industrial safety world of "trips and falls, hard hats, and safety goggles" - its effectiveness in institutional safety (programs and processes) was found lacking. Waivers that even experienced astronauts found startling had become the order of the day and were accepted as a matter of course. Columbia, for example, was flying STS-107 with 3,233 waivers - 36 percent of which had not been reviewed in 10 years. The number of waivers remained a sore spot with technicians and some engineers, but this had become an accepted practice by management. No one knew the extent of the waivers, how one waiver might contraindicate others, or how certain combinations might have a cumulative failure potential. Safety personnel silently observed, if they noticed at all.

An involved and independent safety structure is vital, especially in high-risk organizations like NASA. Safety managers must have an equal voice in decision making, the authority to stop operations, the ability to question waivers and similiar items, and direct access to key decision makers. Further, employees and contractors at all levels must never feel threatened to bring "bad-news" safety issues to their bosses. Overconfidence in organizational safety processes must be avoided since unexpected events require "out-of-the-box" solutions - solutions that generally come from workers in the trenches and not senior management.

Leaders of high-risk organizations must ensure that key program leaders do not unilaterally waive operational or technical requirements, a problem illustrated by NASA's excessive number of waivers. Clearly defined technical requirements and thorough and independent processes for safety and verification can do much to achieve this objective. Such an approach can be bolstered if leaders ensure risk-assessment capabilities are quantitatively based, centralized, and given programwide access for simpler, organizationwide hazard assessments.

Additionally, in complex organizations dealing with high-risk technologies, there must be clarity, uniformity, and consistency of safety responsibilities. Tailoring by program managers or directors should never be permitted unless approval is granted by both the organization having final authority for technical requirements and by the organization having oversight of compliance.

Put Safety First - Safety People, Too

NASA seemed unconcerned about staffing some of its centers' key safety organizations with the right people, and also relegated those activities to back shops that had a minor supporting role and limited authority. This practice must change to ensure a viable first line of defense - safety organizations must be empowered, and safety personnel certainly cannot be treated as second-class citizens in the eyes of others or themselves. Unless this advice is followed, in-line safety organizations will not be the first line of defense they are expected to be.

Keep It Pertinent - and Attractive

Results speak for themselves that organizations should spend a significant amount of energy on safety awareness - not just simple posters, bumper stickers, and doodads. The Navy, for example, has done an admirable job of producing lesson-packed but entertaining articles that appear after every serious accident. These articles allow all sailors to learn from the mishap; indeed, many are enticed to learn through the presentation of the material. Organizations should be committed to the communication of safety lessons, and those that follow such an approach will help their members stay a step ahead in safety awareness.

Third-Party Review Caveats

Be Alert for "Pet Pigs." One of NASA's previous approaches to safety was to form a focus group, relegating safety to the back row of key decision-making meetings. Formed in the wake of the 1967 Apollo 1 fire, the Aerospace Safety Advisory Panel (ASAP) was a solid concept, but it had no authority. The panel was designed to spot neither the smaller, regularly occurring events that happened on the shop floor every day, nor the larger, looming deficiencies waiting to strike.

The ASAP got into a vicious circle with NASA. Its members used the tenure of their position to focus on their "pet pigs," the aspects of the program with which they had familiarity or which were on the members' personal agenda. NASA, in turn, grew to ignore the ASAP, considering nearly everything it did as simply a championing of their pet pigs versus providing safety insights with operational value to NASA. The ASAP and other such panels NASA chartered were rendered ineffective.

The lessons to remember from NASA's experiences are to ensure that the charters of future safety organizations are clear, the qualifications for membership are appropriate for the task, and they have the authority to act. Operations requiring high levels of safety and mission assurance should have full-time safety engineers involved - people or teams who understand systems theory and systems safety. Simply forming another group and naming high-profile members, or getting one more outside assessment, will neither identify systemic safety problems nor cause senior leaders to change the way they do business.

Routinely Review. Successful organizations must have a review process that addresses the findings and recommendations from third-party reviews and then tracks how that organization addresses those findings. As previously discussed, NASA's response to such reviews was, at best, sporadic. That was, in part, because of a mind-set that had grown from their experience with the ASAP - a vicious circle of ignoring pet pigs. However, if a disciplined review process existed to evaluate such inputs, a record would exist to document how review findings were resolved or, perhaps, why they were justifiably ignored.

Err on the side of providing too much rather than too little information in the aftermath of a mistake or failure.

- James M. Strock
Reagan on Leadership

Go "Beyond the Widget"

Rarely is there a mishap caused by a single event or a broken widget. Therefore, after major mishaps - such as aviation and naval accidents - senior leaders must use that opportunity to look at the "whole" organization. Even if the apparent cause of a flight accident is a broken part or an obvious pilot error, there are usually several other contributing factors. Those factors range from design and manufacturing processes to crew training deficiencies and operational employment. For Columbia, the CAIB did not simply conclude that "the foam did it." The CAIB examined NASA's entire organizational and safety structure and found that to be as much at fault as the foam-shedding event. By going beyond the widget, the CAIB in effect said, "The foam did it. . . . The institution allowed it."

Make Benchmarking Bedrock

Leaders of large organizations should consider cross-organizational benchmarking to learn how other like agencies or services implement operational safety into their operations. Benchmarking should also include sharing techniques and procedures for investigating mishaps, with the objective of applying lessons learned toward mishap prevention. For example, spacecraft, aircraft, and submarines have sealed pressure vessels that operate in hazardous environments. Each system requires the integration of complex and dangerous systems, and they all must maintain the highest levels of safety and reliability to perform their nationally significant missions. Each community has something to learn from the others.

Over the years, these organizations [HROs] have learned that there are particular kinds of error, often quite minor, that can escalate rapidly into major, system-threatening failures.

- James T. Reason
Managing the Risks of 
Organizational Accidents

Track Flaws through Closure

The KSC's discrepancy tracking system was a glaring example of a failure to track flaws. KSC had moved away from a previously effective closed-loop tracking system. In that system, an inspector or engineer who observed a failure or problem documented the discrepancy. The problem was then verified with appropriate analysis. The root cause was established, and the appropriate corrective action was determined and incorporated. Finally, the inspector or engineer who had originally discovered the problem evaluated the effectiveness of the corrective action. This ensured the proper disposition of the discrepancy, as well as ensuring that the "fix" was shared with others working on the same or similar systems (in the case of the orbiter at the KSC, the "fix" information would be shared with the personnel at the Orbiter Processing Facilities, the Vehicle Assembly Building, and the launchpad). With these closed-loop and information-sharing processes eliminated, there no longer existed a path to ensure discrepancies were properly resolved or a method to ensure that all who needed to know about the discrepancy were actually informed. The elimination of those processes created the potential for repeat problems.

Organizations must take discrepancy tracking seriously and view inspections as valuable - especially since they can identify deficiencies, force positive change, and make improvements. Inspections may also spur findings and recommendations, and leaders must ensure the organization is responsive to those findings and recommendations within the specified period. Organizational Self-Examination

It's extremely important to see the smoke before the barn burns down.

- Bill Creech
The Five Pillars of TQM

A major strength of organizations that successfully deal with high-risk operations is their ability to critically self-evaluate problems as they are identified. Reporting good news is easy and often useful. However, the reporting of bad news is critical and should be encouraged, and it must be accompanied by a discussion of what will be done about it. The culture within these successful organizations recognizes that simply reporting bad news does not relieve the individual or department of the responsibility to fix it.

Teaming

Develop the Team. As large as NASA is and as unified as the shuttle-related workforce is behind each mission, it had not developed an institutionalized program to identify and nurture a stable of thoroughbreds from which to develop its future senior leadership. As a result, much of NASA's managerial hierarchy, from GS-14 to associate administrator levels, had assumed their positions without having received a prescribed standard of education, career-broadening, leadership experience, or managerial training that collectively would prepare them for their roles of ever-increasing responsibility. In short, NASA found itself with some relatively junior "stars" thrust into positions of immense responsibility for which they were unprepared.

Leaders and organizations that emphasize people over and above organizational processes or products will be able to recruit and retain the very best people - people who will be trained, developed, and rewarded during their careers in the organization. This philosophy not only produces positive results in those directly affected, but also positively influences their coworkers and subordinates who can see, early in their careers, the potential for education and career-broadening opportunities in their future. Organizational leaders should consider executive development programs, such as those followed in the Air Force, to provide professional development and "sabbaticals" at appropriate career phase points.

We train together . . . we fight together . . . we win together.

- Gen Colin Powell

Train for Worst-Case Scenarios. The CAIB found NASA ill prepared for worst-case scenarios. Indeed, evidence revealed that NASA's complacency caused it not to pursue worst-case events or practice using the scenarios those events would generate. For example, despite the tragic Challenger launch accident, NASA still routinely aimed its launch-anomaly practice at emergencies, such as losing a main engine, that resulted in the shuttle not being able to achieve orbit and having to land at an emergency recovery field on the far side of the Atlantic. While this is indeed a serious scenario, the prior failure to pursue and practice orbiter integrity problems, with their potential crew-loss implications, proved to be a continuing blind spot, resulting in the failure to request imagery that could have revealed Columbia's damage from the foam impact.

Safety analyses should evaluate unlikely, worst-case, event-failure scenarios, and then training events should be developed and scheduled, simulating potential catastrophic events. Senior leaders must lead these worst-case training and failure scenarios, which produce an experience base similar to that gained by aircrews during intensive simulator sessions or via Red Flag exercise scenarios. They will develop the ability to make critical decisions during time-sensitive crises, using the experience gained from worst-case exercises. Such an approach to the worst-case scenario will force decision makers to resolve problems using tested and fail-safe processes, thus reducing the chance they could break down in the "fog of war" or during the stress of real-time malfunctions, anomalies, or events.

Those who ignore the past are condemned to repeat it.

- George Santayana

Educate Past Hiccups. Since 1996, over 5,000 Naval Nuclear Propulsion Program members have been educated in lessons learned from the Challenger accident, primarily through the lessons documented in Diane Vaughn's The Challenger Launch Decision.3 NASA, however, seemed to continue to assert its organizational arrogance with a "we know what we're doing" attitude. NASA did not train on the landmark Challenger lessons and never invited Ms. Vaughn to address any of its gatherings.

Senior leaders must ensure that their organization's key members are fully educated on past mistakes, with a focus on lessons learned. That is especially important when its own organizational structure has been at fault in those mistakes. Large, high-risk organizations that act as though they are in denial risk repeating past mishaps. A successful organization must remain a "learning organization," internalizing the lessons from big and small mistakes and continuously improving.

In the Air Force's SR-71 program, for example, past incidents and accidents were studied by all new crew members in an elemental block of instruction. During that block, the crews would review every reportable incident that had occurred during that specialized program's existence - beginning with its first operational sortie in 1968. This program continued through the SR-71's retirement in the 1990s, and contributed to its remarkably strong safety record - a considerable accomplishment for such a unique aircraft, the only one capable of operating in its hostile and unforgiving environment.

Avoid Promoting Unintended Conflicts. The requirement to support the International Space Station had an indirect and detrimental influence on mission preparation for Columbia and STS-107, its final mission. Just as these external factors altered the organizational goals and objectives for Columbia, other factors will affect future operations if management does not recognize those pressures and consciously take measures to counter their influence. The external factors of cost and schedule pressures, for example, can have a negative influence on safety and reliability. Leaders must ensure that their support of other programs and management tools is not allowed to cause "unintended consequences" which may force subordinate operators and leaders to make questionable decisions.

In discussing such organizations [HROs], it's emphasized that, "The people in these organizations . . . are driven to use a proactive, preventive decision making strategy. Analysis and search come before as well as after errors . . . [and] encourage:

  • Initiative to identify flaws in SOPs [standard operating procedures] and nominate and validate changes in those that prove to be inadequate
  • Error avoidance without stifling initiative or (creating) operator rigidity"
  • - T. R. LaPorte and P. M. Consolini
    James Reason, Managing the Risks
    of Organizational Accidents

    Seek to "Connect the Dots." Within NASA, the machine was talking, and no one was listening - neither program management nor maintenance process owners recognized the early warning signs of their defects. For example, the tile damage caused by foam impact was a maintenance problem that repeated itself on every flight. However, maintenance process owners did not present that information as a preventable problem above midlevel personnel. More often, the emphasis was on how to repair and improve an orbiter's tile adhesion and resiliency versus finding the sources of the tile's damage - the lack of the external tank's foam adhesion.

    Although these errors can occur in any large organization, successful organizations are sensitive to "weak signals" and make improvements by investigating and acting on the occasional small indicator. These organizations must be sensitive enough to learn from - and not overlook - "small" incidents; its members must be encouraged to highlight such incidents. Leaders cannot wait until a major catastrophe occurs to fix internal operations issues or safety shortfalls.

    Sustain Sustainment. Although the shuttle was altered from a system originally programmed and designed to fly 100 flights in 10 years, to one to last until 2006, then 2012, then 2020, no viable sustainment plan was built.

    Should a need arise to continue to operate a system beyond its initially designed service life, as happened with the shuttle program, an extended lifetime must be carefully calculated, and plans must be developed and executed to permit the system to reach its new service life safely. Initial program planning must include sustainment mechanisms for the duration of its planned existence; those mechanisms must be modifiable and then adjusted to properly sustain the program when the life of that program is extended. Air Force system sustainment and service life extension programs (SLEP), for example, provide a benchmark for the level of excellence other organizations (including NASA) could emulate. The concept of having lifelong sustainment as an equal-to or more-important goal than the original certification, keeps the Air Force a step ahead by strongly encouraging the design of systems with maintenance in mind, and the building of data and processes that monitor the fleet's health. Such an approach attempts to anticipate the need and then adjust the sustainment measures to reflect the unavoidable, changing environment that accompanies aging products.

    Except in poker, bridge, and similar play-period activities, don't con anyone - especially yourself.

    - Robert Townsend
    Further Up the Organization

    Don't Confuse Tomorrow's Dream with Today's Reality. NASA allowed the shuttle to effectively transition from a research and development system to operational status, despite the fact that prior to the Columbia tragedy there had only been 111 successful shuttle flights. In contrast, the Air Force's F/A-22 is programmed for 2,500 flights, nearly 4,600 test hours, before being deemed operational. Although the space shuttle should be considered experimental because of the nature of its mission profiles, it was, due to its commitments and ISS obligations, processed and operated as an operational vehicle.

    Senior leaders must ensure that a vehicle or program still in the R&D stage is not treated as operational and fielded - an experimental vehicle or program must be treated as such. Although the loss of Columbia cannot be directly tied to the confusion between R&D and operational, it did influence certain decisions that may have changed the fate of the crew; a decision not to pursue imagery eliminated the consideration of an on-orbit repair or rescue mission.

    Outsourcing Caveats

    Retain and Exercise Accountability? In many ways, NASA is a victim of the same government financial reform initiatives that many organizations face. For example, turning work over to a contractor and then reducing the size of the government staff charged with monitoring the contractor is not unique to NASA.

    Often, government reform initiatives can blur the lines of accountability or even violate Federal Acquisition Regulations - they certainly did within NASA. Although the government's responsibility and authority roles were diminished in the shuttle program, the accountability role clearly should not have been - just as it should not diminish in any organization.

    Contracting Caution: Expertise Loss Ahead. The United Space Alliance's Space Flight Operations Contract (SFOC) with NASA and the resulting loss of technical expertise within NASA are good examples of diminishing government expertise. In NASA, senior management often evolved to the point of being uninformed when compared to the expertise of its prime contractor, United Space Alliance, and the prime's subcontractors.

    Leaders must ensure that appropriate organizational expertise is retained as processes and programs are contracted out. If not, the organization itself will wilt; it will merely have individuals overseeing contracts and matters in which they have very little technical expertise. When considering organization and contractor interface, the question becomes, "How much technical expertise should reside with the contractors on an operational system?" If contractors are given too much independence, over time, they may begin to drive new requirements - something that should be done only by the owning organization. Successful organizations cannot afford to lose their corporate knowledge and must avoid the easy and economically tempting solution of privatizing technical expertise. Finally, just as warriors must understand their commander's intent, contract structures must ensure that organizational goals are fully understood and met by the people who have been contracted to carry them out. Unfortunately for the shuttle, incentives were weighted more toward launching shuttles and meeting interim schedule milestones than correcting problems, which had significant safety implications.

    Outlaw Normalization of Deviance. The space shuttle travels through arguably the most hostile environment on or above Earth - and NASA made it look easy. However, in clear violation of written design specifications, foam and debris were falling off and hitting the orbiter during its launches. Nevertheless, as more and more flights landed successfully, the perception of danger from debris and foam strikes continued to diminish as a concern. Successful flights, despite failing to satisfy the design requirements that prohibited foam strikes, serve as examples of how success can set an organization up for future failure. When such unplanned-for occurrences are ignored, left unresolved, or shortcut fixes are accepted today - the consequences may be catastrophic results tomorrow. As this tragedy underscored, past successes - or lack of failures - helped create and expand blind spots, bureaucratic complacency, and "group think" when approaching anomalies such as debris strikes.

    Due to the normalization of prior shuttle debris events, when foam was seen striking Columbia on STS-107, senior leaders and decision makers were already convinced that foam could not bring down an orbiter, and viewed this as nothing more than a maintenance turnaround issue. By letting "the unexpected become the expected that became the accepted," NASA had achieved what Diane Vaughn termed the normalization of deviance.4 

    Uncorrected minor and seemingly insignificant deviations from accepted norms and technical requirements can lead to catastrophic failure - an unacceptable and often predictable consequence of normalizing deviance. Leaders must maintain a constant vigilance to avoid complacency and acceptance of anomalies, regardless of how risky the technology may be.

    A Closing Thought

    A total of 16 people - two space-shuttle crews and two helicopter-crew members - perished because NASA failed to go "beyond the widget." If NASA will now absorb the hard lessons from this tragedy, it can remove the conditions that make it ripe for another disaster. Likewise, any organization not abiding by the lessons to be learned from this tragedy may be creating its own recipe for disaster, for these cancerous conditions may be present in any organization.

    These lessons, affirmed by Columbia's loss, are summarized in the 20 primary questions below - questions all organizations should periodically ask of themselves to prevent complacency and forgo the potential calamities complacency could facilitate. As you review these questions, you might consider, "The foam did it. . . . The institution allowed it." The questions to ask yourself are, "What foam do you have . . . and what are you allowing?"

    An Organizational Self-Examination Checklist

    Basics

    1. Do you "keep principles principal"?
  • Avoid compromising principles?
  • Avoid clouding principles?
  • Avoid migrating to mediocrity?
  • Maintain checks and balances?
  • Avoid an atrophy to apathy?
  • Control "configuration control"?
  • Avoid "fads"? Question their applicability?
  • Keep proper focus?
  • 2. Do you communicate, communicate, and communicate? 
  • Insist on discussion?
  • Encourage minority opinions?
  • Conduct effective meetings? 
  • 3. Do you affirm that management information systems matter?

    4. Do you avoid "organizational arrogance"?

    5. Do you remain thorough and inquisitive? 
  • Avoid leadership by PowerPoint? 
  • Mandate "Missouri show-me mind-sets"?
  • Question untested assumptions? 
  • 6. Do you ensure taskings and resources balance?

    Safety
    7. Do you stop stopgap safety? 
    8. Is safety first . . . safety people, too?
    9. Are you keeping safety pertinent - and attractive?

    10. Are you aware of third-party review caveats? 
  • Watching for "pet pigs"?
  • Routinely reviewing inputs?
  • 11. Do you go "beyond the widget"?
    12. Is benchmarking bedrock?
    13. Are you tracking flaws through closure?

    Organizational Examination
    14. Are you promoting teaming? 
  • Developing the team? 
  • Training for worst-case scenarios? 
  • Educating past hiccups - others' and your own?
  • 15. Do you avoid promoting unintended conflicts?
    16. Do you seek and attack signals to "connect the dots"? 
    17. Are you sustaining sustainment?
    18. Does tomorrow's dream distort today's reality?
    19. Are you aware of outsourcing caveats?
  • Outsourcing accountability?
  • Outsourcing expertise? 
  • 20. Are you outlawing "normalization of deviance"?

    Notes

    1. Testimony of Harry McDonald bef


    Please follow SpaceRef on Twitter and Like us on Facebook.