Reliability Centered Maintenance - The Seven Questions Methodology
Reliability Centered Maintenance (RCM) is a structured decision-making process used to determine what must be done to ensure physical assets continue to do what their users require in their present operating context.
RCM2 was developed by John Moubray in the late 1980s, adapting the original aviation RCM methodology (Nowlan & Heap, 1978) for industrial applications. It is the most widely used and rigorous form of RCM, fully compliant with SAE JA1011.
"A process used to determine what must be done to ensure that any physical asset continues to do whatever its users want it to do in its present operating context."
RCM focuses on preserving system FUNCTION, not just preventing equipment failure. The consequences of failure drive maintenance decisions, not the failure itself.
Identify and protect the functions that matter most to operations, safety, and environmental compliance.
Determine all the ways each function can fail and the causes (failure modes) that are reasonably likely to occur.
Classify failures by their consequences: hidden, safety/environmental, operational, or non-operational.
Choose maintenance tasks that are technically feasible AND worth doing based on consequences.
Achieve the right balance of proactive and reactive maintenance at minimum total cost.
Maintenance Steering Group (MSG) develops systematic maintenance for Boeing 747
United Airlines publishes "Reliability-Centered Maintenance" for US Department of Defense
John Moubray adapts RCM for industrial applications, creates RCM2 methodology
SAE publishes standard defining minimum criteria for RCM processes
RCM2 used globally across aviation, military, power generation, manufacturing, and more
RCM2 answers these seven questions in order for each asset in its operating context. The first four questions form the FMEA (Information Worksheet), while questions 5-7 drive the Decision Diagram.
Example: "To pump water from Tank A to Tank B at not less than 800 liters per minute"
Example: "Unable to pump any water" or "Pumps less than 800 L/min"
Examples: Impeller worn, Seal failure, Bearing seizure, Motor winding failure, Coupling broken
A task is only selected if it is technically feasible AND worth doing based on consequences.
Important: Run-to-failure is NEVER acceptable if failure has safety or environmental consequences!
Research by Nowlan & Heap (1978) revealed that equipment doesn't always fail in predictable, age-related patterns. These six patterns show the conditional probability of failure over time.
High infant mortality, low random, then wear-out zone
Low random failure, then increasing probability at end of life
Gradually increasing probability of failure, no distinct wear-out
Low initially, rapid rise to constant random level
Constant probability of failure at any age
High initial failure, then low random probability
Only 11% of failures are age-related (Patterns A, B, C). The remaining 89% are random or have infant mortality characteristics. Time-based maintenance alone cannot prevent most failures!
Most failures attributed to Pattern F are caused by human error during installation, maintenance, or operation. Proper training, procedures, and precision practices can significantly reduce these failures.
| Pattern | Type | UAL Study (1978) | Broberg (1973) | SUBMEPP (1993) | Effective Maintenance |
|---|---|---|---|---|---|
| A - Bathtub | Age-Related | 4% | 3% | 6% | Scheduled discard/restoration after burn-in |
| B - Wear-Out | Age-Related | 2% | 1% | 17% | Scheduled discard before wear-out point |
| C - Fatigue | Age-Related | 5% | 4% | 3% | Condition monitoring for degradation |
| Total Age-Related | 11% | 8% | 26% | ||
| D - Break-In | Random | 7% | 11% | 6% | Condition monitoring |
| E - Random | Random | 14% | 15% | 42% | Condition monitoring, redundancy |
| F - Infant Mortality | Random | 68% | 66% | 29% | Precision maintenance, proper procedures |
| Total Random | 89% | 92% | 77% | ||
UAL = United Airlines (aviation), Broberg = Swedish study, SUBMEPP = US Navy Submarine Maintenance
RCM2 first asks whether failure is evident to operators under normal circumstances. Then it classifies evident failures by their impact. This determines the type and urgency of maintenance required.
Failure could injure or kill someone, or could breach environmental regulations or standards.
Must eliminate or reduce risk to tolerable levels
Failure affects production output, product quality, customer service, or operating costs beyond repair.
Task must reduce total cost of failure
Failure involves only the direct cost of repair - no safety, environmental, or production impact.
Task must cost less than repair cost
Someone will become aware of the failure under normal circumstances - through alarms, obvious malfunction, quality defects, etc.
The failure will not be noticed unless a specific check is made, OR until a demand is placed on the device (e.g., safety device needed during emergency).
Hidden failures expose the organization to multiple failures - if a protective device has already failed silently, and the protected equipment then fails, the consequences can be catastrophic.
Key Principle: Hidden failures alone have no direct impact. Their consequences only become apparent when combined with another failure. This is why failure-finding tasks are essential for protective devices.
Is the failure evident to operators during normal operation?
NO โ HIDDEN
YES โ EVIDENT
For EVIDENT failures: Does it have safety or environmental consequences?
YES โ S/E
NO โ Check operational
Does it affect operations (production, quality, cost)?
YES โ OPERATIONAL
NO โ NON-OPERATIONAL
Tasks performed before failure to prevent or predict failure. Must be technically feasible AND worth doing.
Detect potential failure with enough warning to act. Requires detectable P-F interval. Examples: vibration analysis, thermography, oil analysis, ultrasound.
Restore to original capability at fixed intervals regardless of condition. Requires age-reliability relationship and identifiable "life" point.
Replace at or before specified life limit regardless of condition. For items where restoration is impractical or where failure is catastrophic.
RCM2 prefers tasks in this order: On-Condition โ Scheduled Restoration โ Scheduled Discard. On-condition is preferred because it bases action on actual condition, not assumed age.
Used when NO proactive task is technically feasible or worth doing.
Scheduled checks to discover hidden failures before a demand. Used for protective devices. Interval based on acceptable unavailability.
Modify hardware, operating procedures, or training. Required when consequences are intolerable and no suitable task exists.
Allow failure to occur, then repair. ONLY acceptable if failure has NO safety or environmental consequences.
Run-to-failure is NEVER acceptable for failures with safety or environmental consequences! If no proactive task works, redesign is mandatory.
| Consequence | Proactive Task Criteria | If No Proactive Task |
|---|---|---|
| Hidden | Task must reduce risk of multiple failure to tolerable level | Failure-finding task is mandatory. If not possible, redesign |
| Safety/Environmental | Task must reduce risk of failure to tolerable level (or eliminate) | Redesign is mandatory. Run-to-failure NOT acceptable |
| Operational | Total cost of task must be less than total cost of failure over time | Run-to-failure may be acceptable if economically justified |
| Non-Operational | Cost of task over time must be less than cost of repair | Run-to-failure is acceptable (direct repair cost only) |
RCM2 analyses are conducted by cross-functional teams, not individuals. Typical team of 6-8 members:
Guides the process, ensures rigor, trained in RCM methodology. Does NOT need to know the equipment.
Knows what the equipment must do, operating context, production requirements.
Knows how equipment fails, repair history, maintenance practices.
Technical expertise, design intent, modifications, OEM knowledge.
Regulatory requirements, hazard identification, environmental compliance.
The team must include people who operate and maintain the equipment daily. They have knowledge no document can capture. RCM2 captures institutional knowledge before it's lost.
Records answers to Questions 1-4:
Records answers to Questions 5-7:
| Phase | Step | Description | Key Outputs |
|---|---|---|---|
| Preparation | 1 | Select system/equipment for analysis (criticality ranking) | Prioritized asset list |
| 2 | Define system boundaries and operating context | Context statement, boundaries | |
| 3 | Assemble team, gather documentation | Team roster, P&IDs, manuals | |
| Analysis | 4 | Identify functions and performance standards (Q1) | Function list |
| 5 | Identify functional failures and failure modes (Q2-Q3) | FMEA worksheet | |
| 6 | Document failure effects (Q4) | Completed Information Worksheet | |
| 7 | Apply decision logic for each failure mode (Q5-Q7) | Decision Worksheet, task list | |
| Implementation | 8 | Review and approve maintenance tasks | Approved task list |
| 9 | Load tasks into CMMS, assign resources | PM work orders | |
| 10 | Train personnel, begin execution | Trained workforce | |
| Living RCM | 11 | Monitor results, track failures | Performance metrics |
| 12 | Update analysis based on new data | Revised worksheets |
| Standard/Resource | Description |
|---|---|
| SAE JA1011 | Evaluation Criteria for RCM Processes |
| SAE JA1012 | Guide to the RCM Standard |
| IEC 60300-3-11 | RCM Application Guide |
| MIL-HDBK-2173 | US Military RCM Requirements |
| NAVAIR 00-25-403 | Naval Aviation RCM Guidelines |
| RCM2 Book | John Moubray - "Reliability-Centered Maintenance" (2nd Ed.) |