Blogs

Sign-up to receive the latest articles related to the area of business excellence.

Introduction to Reliability

View All Blogs

All products will eventually fail. An important concept that quantifies dependability of a product during the life-cycle of a product is the concept of reliability. Reliability is defined as the probability of success. The objective of studying reliability is to be able to quantify how “reliable” a product or service is, understand the causes for poor reliability, and deploy actions to improve the reliability of products. Having unreliable products contributes to increased cost of ownership of the products. In this article, we will focus on some basic concepts of reliability.

What is Reliability?

In a strict sense, reliability is defined as the probability that a device will perform its intended function during a specified period of time under stated conditions. The constraint of “stated conditions” is important as it is impossible to estimate the failure probability for unlimited conditions. Reliability usually changes as a function of time and is denoted as R(t). Examples of reliability statements are:
• The basic coverage warranty lasts for 36 months or 36,000 miles.
• We warrant the bulb will be free from defects and will operate for 3 years based on 3 hours/day.
The formula for calculating reliability is:

What is Failure Probability?

The failure probability, F(t), is the probability of failure. The two concepts are closely related since reliability is one minus the failure probability. Let’s assume that we have purchased 100 products. It is assumed that when you purchase a product at time t=0, the product is working so that R(0) = 1 and F(0) = 0. Eventually, the products will start to fail with time. However, different products will fail at different time points due to inherent variability in the products, components used and the application of the products in the real world. Hence, reliability and failure probability are reported in terms of the probability of success of failure. The formula for calculating failure probability is:

What is Failure Rate?

A related concept that is usually useful to define is the failure rate usually denoted by lambda(t). The failure rate is a number that quantifies the rate of failure of products relative to the currently running products. We need to consider if the failed products are repaired and put back or products cannot be repaired on failure. For repairable systems, the number of working systems at the start of the time period will be the same for every period. For non-repairable systems, the number of working systems keeps reducing on failures. The formula for calculating failure probability is:

Example Calculation

What is the probability that a product will fail in a given time? If out of 100 products, 8 fail at the end of 10 hours, then the failure probability is 8/100 or 8% and the reliability is 92/100 or 92%. The following table shows the reliability and failure probability over time for repairable systems. Over time, the failure probability increases from 0 to 1 and the reliability will go from 1 to 0.
TimeFailuresReliabilityFailure ProbabilityFailure Rate
0 – 10 hours80.920.080.008
10 – 20 hours60.860.140.006
20 – 30 hours40.820.180.004

The following figure shown an example plot of reliability and failure probability over time.

Comparison of Reliability and Quality
Quality shows how well a product or service performs its intended function. For example, a product has good quality if it is safe, efficient, and easy to operate. Reliability on the other hand talks about how well the product or service maintains its original level of performance over time through various operating conditions. For example, a product is expected to last 100 hours under normal operating conditions.

Comparison of Reliability and Availability
Availability refers to the percentage of time the product remains operational under normal circumstances in order to serve its intended purpose. Example, a system that is available for 90% of the time could mean that there is a downtime of 72 hours per month. Reliability refers to the probability that the system will meet certain performance standards for a desired time duration. This can be thought of as the uptime. A common metric for reliability is the Mean Time Between Failure (MTBF). Once the system is down, it would take some time to restore the operation back to normal. This period of time is called the downtime and could be typically measured by the Mean Time to Repair (MTTR). The combination of uptime and downtime impacts the availability of the product or service.

Modeling Reliability

One of the commonly used distributions used to analyze reliability is the Weibull distribution due to its versatility to handle different situations. A Weibull distribution has two parameters alpha and beta.

Alpha is called the scale factor which represents the characteristic life of the product – which is the time at which 63% of the products have failed. Note from the equation above that if t = alpha, then the reliability value is R(t) = exp (-1) = 0.27. Hence, the failure probability is 0.63 independent of the value of beta. Beta is called the shape factor and represents the different shapes for the Weibull distribution. A value of beta less than 1 represents early failures (typically observed at the beginning of the product life when the failure rate is high and as things are improved the failure rate falls initially), a value of beta equal to 1 represents constant failure rate (or exponential distribution during the useful life of the product), and value of beta greater than 1 represents wear-out failures (typically observed at the end of the product life where components start wearing out and increase the failure rate until all products fail).

If failure data is available for past failures, distribution fitting can be done to estimate the Weibull parameters (alpha and beta). Once the reliability curve is obtained, we can use this to quantify important reliability products of the products and answer questions such as “how long with 90% of the products be expected to last?”. We can also use the estimated beta parameter to determine the type of failures we are encountering (such as early life, constant rate or wear-out).

Reliability in the Real World

There are several complicating factors when we try to define reliability in the real world. Not all the products are sold and installed at the same time. Different products are installed at different times, so estimating the reliability in the real-world is more complex. Secondly, the status of the product keeps changing due to repairs, upgrades and modifications making it hard to compare data across products. Finally, not all products are operated all the time. For example, one person may use a TV for 1 hour per day and another may use it for 4 hours per day. Hence, it is difficult to use calendar time to estimate reliability since calendar time is often not the operational time.

Estimation of parameters in the real world is also complex and is not a simple regression fit of the data. Any study we perform to estimate the failure data will result in certain units not failing within the test times (also called censored data). For example, if we start testing with 10 products, the first product fails in 110 hours, the second one fails in 124 hours, and the third one fails in 200 hours. The remaining products are still operating at 500 hours where we had to stop the study due to time and/or cost considerations. We need to account for these seven non-failed products when we are estimating the Weibull parameters. We will cover estimation of Weibull parameters in the presence of censored data in a later article.

Summary

We discussed the concepts of reliability, failure probability, and failure rate. We also made a distinction between quality and reliability. Finally, we discussed how to model reliability using the Weibull distribution. In the next article on this series, we will discuss how to estimate the Weibull model parameters from real-world failure data.