The previous week I played a bit with some reliability models and estimators to explore how a drone-based company could run survival analysis over its fleet in order to make decissions on the hardware plane while keeping its operational risks and costs under control.

This entry contains some references, pointers and experiments on applying survival analysis over simulated UAV samples with R

**Survival analysis, reliability and lifetime**

Let's start with the survival analysis definition. Wikipedia describes the survival analysis as "a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology."

This definition is interesting because it introduces the concept of 'expected duration of time until one or more events happen' and it also links the concept on other areas and disciplines such as engineering or economics.

The expected duration of time until one more events happen is also know as the 'lifetime'. In biostatistiscs the concept of lifetime fits natural but in the case of engineering you need to think in the lifetime of some component, system or product till the event happens.

Other way of looking at lifetime is the time elapsed since the 'birth' of the system till the appearance of certain event which we call 'failure'.

The duration of the lifetime is usually measured in time units but it can be measured in other units if needed. For example you could be interested to know the expected number of hours before one failure happens with some engine, but other person could be interested on the number of revolutions of that same engine before the failure happens.

In the case of UAVs some metrics, such as the 'number of hours flying' or the 'number of flights', could make sense to work in this domain. I say it could make sense because it all depends on how you define the concept of 'UAV survival'. Here we are assuming one UAV survives when it comes back from one mission without major damages and we can reuse the vehicle with minor maintenance for the next mission.

On the other hand some company could consider one-way missions where vehicles go across hostile wheather conditions and 'surviving' would mean to reach the destination.

**The survival function**

After the lifetime concept definition, it is possible to define the concept of reliability of a system or component. This definition is quantitative and it links with the concept of probability.

The survival function, also known as the realiability function, is defined as the probability that the lifetime T will exceed some value t in a defined time scale.

F(t) is the cumulative distribution function (c.d.f.) of the random variable T. So the survival function S(t) gives the probability that a subject will survive past time t.

In theory, the survival function is smooth but in practice, we observe events on a discrete time scale.

**The failure rate or hazard rate**

The function h(t), defined for all t > 0 such that F(t) < 1, is called the failure or hazard rate.

Supposing that we know the system failure rate h(t), it is possible reconstructing in an unique way its lifetime density function and the lifetime c.d.f.

If h(t) is plotted against time for a general system the well-known "bathtub curve" appears. This curve can be divided into three regions (I, II and III).

The region I is termed the region of "infant mortality" or "early-life failures" where sometimes an underlying distribution is difficult to determine. Manufacturers will frequently subject their product to a burn-in period attempting to eliminate the early failures before lots are shipped to the consumer.

The region II corresponds to "a constant failure rate function" or "useful-life failures", and is the region of chance failures to which the exponential distribution applies.

The region III corresponds to a "wear-out process" or "wear-out failures" for which the normal distribution or the more general Weibull distribution often provide adequate models.

If a system or component failure distribution is described using the exponential distribution, then this implies its failure rate function is constant, and we can work on the steady state portion of the bathtub curve with no burn-in or wear-out happening.

**Reliability databases**

The failure rate is a critical parameter in the survival and reliability analysis of a system or component. It may be quite difficult and expensive to know this, and other reliability parameters, using experimentation and testing in physical environments. In some cases, it is even not possible due to technical requirements and lack of tooling.

The failure rate and other key parameters in the reliability industry are usually provided through reliability databases.

Some popular although privative databases are the Process Equipment Reliability Database (PERD), the Electronic Parts Reliability Data (EPRD), the Nonelectronic Parts Reliability Data (NPRD) and the Offshore Reliability Data (OREDA).

On the open side, there are also well-known databases on-line such as the Handbook of Mechanical Reliability and the MIL-HDBK-217F, Reliability Prediction of Electronic Equipment.

**Emergency repair vs Preventive maintenance**

Take as example the consecuences of suffering one critical failure impacting the ability to fly the vehicle along one mission and the costs to recover it after a severe crash.

These costs should be assigned to emergency repair budget. It is not possible having the vehicle available for the next mission without fixing/replacing the broken parts and checking the whole system is right to fly again.

Other category of costs falling in the repair budget budget are the actions required to search and recovery the vehicle physically. Depending on the kind of terrain where the drone is operating (sea, mountains, jungle...) it is possible you can not locate the vehicle and, even if you make the decision to skip the search and recovery operation, you will have to account the cost of the vehicle, the cost of the mission payload and the cost of opportunity on next missions (business cost).

As expected, the emergency repair costs are high, unpredictable and difficult to fit in a sustainable business working at scale.

So as far as possible a better approach would be exchanging these emergency repair costs for preventive maintenance costs. Both emergency repair and preventive maintenance completely renew the system but the second one works in the so-called 'age replacement scheme'.

Here we are supposing the system has a known lifetime distribution and we make the decision to make preventive maintenance when the system reaches age T. The optimal T would minimize the average per unit time.

**Maintenance rate and its optimization**

If every vehicle is running one mission after another uninterruptedly for business reason, we should be interested on the vehicle's maintenance rate and its optimization. This optimization can reduce the number of vehicles in the fleet or the component stock in the company noticeably.

This paper is an effort in this direction. It covers how this optimization could work according to the model of a single-server queueing system. It also brings in the costs per time unit of restoration and maintenance (monetary units/hour).

The authors of the paper report average specific income increase per time unit for at least 15% and reduces the average specific cost per time unit of correct device operation more than 1.5 times according to the strategy proposed.

**Failure analysis methods: FMEA and FTA**

Failure Mode and Effects Analysis (FMEA) and Fault Tree Analysis (FTA) are two usual methods and techniques used in failure analysis. One nice article covering this topic with UAVs is available here.

Although the goal of the paper is working on the impact of low-cost UAV redesigns and its advantages, it describes a possible methodology using FMEA and FTA as part of the effort.

The paper shows how it would be possible making decisions based on the failure rate assigned to the failure categories. We would be able to compare the overall probability for a 'category X' failure, calculated to be Y failures per Z flight hours, over other category and/or redesign.

The Hazard Analysis Matrix (HAM) is also used in the article to visualize and quickly determine which components would be most problematic and would most likely require consideration.

This NASA handbook on Fault Tree with Aerospace Applications is a great resource too.

**Failure metrics: MTTR, MTTF and MTBF**

The Mean Time Between Failures (MTBF) refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of the Mean Time to Failure (MTTF) and the Mean Time to Repair (MTTR), the total time required for a device to fail and that failure to be repaired.

In our case, one faulty unit (vehicle, component, etc) with an MTTF of 85 hours and an MTTR of 15 hours would have an MTBF of 100 hours.

**Estimating the survival function**

To estimate the survival function we can follow two approaches. We can use non-parametric estimators or we can estimate the survival distribution by making parametric assumptions.

In the non-parametric approach we are interested in the Kaplan-Meier product-limit estimator. This estimator works with censored data and it is useful to support right-censoring, which happens if we have to withdraw the UAV from a study because the UAV was stolen, lost, etc. or no event/failure happened at last follow-up.

It is relevant mentioning the log-rank test in this point, also known as the Mantel-Cox test. This hypothesis test is a non-parametric test appropiated to use when the data are right-censored. It is used to compare the survival distributions of two samples.

On the other hand, the estimator is limited in its ability to estimate survival adjusted for covariates although parametric survival models and the Cox proportional hazards model may be useful to overcome this limitation.

In the case of parametric approach we are interested to make more assumptions that allow us to model the data in more detail.

The parametric approach will estimate S(t) more precisely than the Kaplan-Meier estimator assuming the parameters are right. It is useful to compute selected quantiles of the distribution, the expected failure time, etc.

Three of the most popular distributions used to estimate survival curves are the Weibull, Exponential and Log-normal distributions.

We will be also interested in the Maximum Likelihood Estimation (MLE) and Median Rank Regression (MRR) methods to estimate the unknown parameters on some of these distributions.

**The survival analysis support in R**

The CRAN Task View: Survival Analysis resource works as an index on well-supported and available R packages related to survival analysis and reliability. It presents the useful R packages for the analysis of time to event data.

Beyond of the standard survival analysis section, it lists and describes packages covering multistate models, relative survival, random effect models, multivariate survival, bayesian models, prediction performance, power analysis, simulation, etc.

**The Kaplan-Meier product-limit estimator in R**

To illustrate the Kaplan-Meier estimator in R we will use this simulated dataset. It contains four columns: 'subject', 'time', 'status' and 'rx'

'Subject' is an unique id in our simulated UAV fleet.

'Time' is the time of failure or the last time on which the UAV was known to work properly.

'Status' signals with 0 the censored data and with 1 the failure events.

'rx' signals with 0 the UAVs belonging to batch with the 'Engine A' and with 1 the batch with the 'Engine B'. This column will be useful to compare the survival curves between these two subgroups in the sample.

To load the dataset and have a look in the data:

> library('survminer') > library('survival') > xx <- read.csv("ds-km.csv", sep = ";", header = T) > head(xx) subject time status rx 1 1 9 1 0 2 2 13 1 0 3 3 13 0 0 4 4 15 1 0 5 5 24 1 0 6 6 26 1 0 > dim(xx) [1] 50 4 > table(xx$rx) 0 1 25 25 > table(xx$status) 0 1 14 36

Right now, we can run the survfit() function to calculate the Kaplan-Meier estimator over the whole sample together with a summary:

> fit <- survfit(Surv(time, status) ~ 1, data = xx) > summary(fit) time n.risk n.event survival std.err lower 95% CI upper 95% CI 9 50 1 0.9800 0.0198 0.9420 1.000 13 49 1 0.9600 0.0277 0.9072 1.000 15 47 1 0.9396 0.0338 0.8756 1.000 24 45 1 0.9187 0.0390 0.8454 0.998 26 44 2 0.8769 0.0471 0.7893 0.974 27 41 1 0.8555 0.0506 0.7620 0.961 31 40 1 0.8342 0.0536 0.7354 0.946 33 38 2 0.7903 0.0591 0.6825 0.915 34 36 1 0.7683 0.0614 0.6569 0.899 39 34 1 0.7457 0.0636 0.6309 0.881 40 33 1 0.7231 0.0656 0.6053 0.864 42 32 1 0.7005 0.0673 0.5802 0.846 43 31 1 0.6779 0.0688 0.5556 0.827 47 28 1 0.6537 0.0705 0.5291 0.808 51 25 1 0.6276 0.0724 0.5006 0.787 53 24 1 0.6014 0.0739 0.4726 0.765 54 22 2 0.5467 0.0767 0.4154 0.720 55 20 1 0.5194 0.0775 0.3876 0.696 59 19 1 0.4921 0.0781 0.3605 0.672 62 18 1 0.4647 0.0784 0.3338 0.647 64 16 2 0.4066 0.0786 0.2783 0.594 65 14 1 0.3776 0.0782 0.2516 0.567 67 13 2 0.3195 0.0762 0.2002 0.510 69 11 1 0.2905 0.0746 0.1756 0.481 70 10 1 0.2614 0.0726 0.1517 0.450 75 8 3 0.1634 0.0637 0.0761 0.351 80 4 1 0.1225 0.0595 0.0473 0.317 82 3 1 0.0817 0.0518 0.0236 0.283 90 1 1 0.0000 NaN NA NA

The survival with time 27 hours is 0.855 ( IC95%: 0.762 – 0.961 ). We highlight in red the median (59), Q1 (75) and Q3 (39).

To plot the survival curve:

> ggsurvplot(fit, data = xx, xlim = c(0,90), xlab = "Time in hours", break.time.by = 5, censor.shape="|", censor.size = 4)

To compare the survival curves for the two batches 'Engine A' and 'Engine B':

> fit_rx <- survfit(Surv(time, status) ~ rx, data = xx)

To plot the survival curves with a customized output:

> ggsurvplot( fit_rx, # survfit object with calculated statistics data = xx, # data used to fit survival curves risk.table = TRUE, # show risk table pval = TRUE, # show p-value of log-rank test conf.int = TRUE, # show confidence intervals for # point estimates of survival curves palette = c("#E7B800", "#2E9FDF"), xlim = c(0,90), # present narrower X axis, but not affect # survival estimates xlab = "Time in hours", # customize X axis label break.time.by = 5, # break X axis in time intervals by 5 ggtheme = theme_light(), # customize plot and risk table with a theme risk.table.y.text.col = T, # colour risk table text annotations risk.table.height = 0.25, # the height of the risk table risk.table.y.text = FALSE, # show bars instead of names in text annotations # in legend of risk table ncensor.plot = TRUE, # plot the number of censored subjects at time t ncensor.plot.height = 0.25, conf.int.style = "step", # customize style of confidence intervals surv.median.line = "hv", # add the median survival pointer legend.labs = c("Engine A", "Engine B") # change legend labels )

To calculate the log-rank test:

> survdiff(formula = Surv(time, status) ~ rx, data = xx) N Observed Expected (O-E)^2/E (O-E)^2/V rx=0 25 19 9.2 10.43 16.8 rx=1 25 17 26.8 3.58 16.8 Chisq= 16.8 on 1 degrees of freedom, p= 4.18e-05

...and the Wilcoxon test:

> survdiff(formula = Surv(time, status) ~ rx, data = xx, rho=1) N Observed Expected (O-E)^2/E (O-E)^2/V rx=0 25 14.13 7.08 7.03 14.9 rx=1 25 7.28 14.34 3.47 14.9 Chisq= 14.9 on 1 degrees of freedom, p= 0.000113

We get P < 0.001, so that the difference between the groups is statistically significant.

Because the log-rank test is purely a test of significance it cannot provide an estimate of the size of the difference between the groups or a confidence interval. For these we must make some assumptions about the data. Common methods use the hazard ratio, including the Cox proportional hazards model.

Here you can read more on the Cox proportional-hazards model in R.

**The Weibull model support in R**

In order to explore the Weibull model support in R, we will work with this sample following an unknown Weibull distribution.

We will adopt the OpenReliability.org R packages to fit the model. They are available here. These two R commands will pull, build and install the needed packages.

> install.packages("RcppArmadillo") > install.packages("abrem", repos="http://R-Forge.R-project.org")

To find the unknown parameters for this Weibull distribution we can use different methods as described in this paper. We will use abrem for 2-parameter Weibull plotting with the Maximum Likelihood Estimation (MLE) fitting method.

> library('abrem') > xx.data <- read.csv('ds-wb.csv')$val > xx.obj <- Abrem(xx.data) > xx.fit <- abrem.fit(xx.obj, method.fit="mle", col="red") > plot(xx.fit)

It is also possible adding confidence intervals and B-life determinations to abrem objects. This resource shows how it could be done.

The plot reports the parameters beta = 1.073 (shape) and eta = 25.39 (scale).

Those parameters can be used to calculate the reliability of a subject (vehicle, engine, etc) with a concrete time. For example, with time = 2 the reliability is 0.9366611.

> pweibull(2, 1.073, scale = 25.39, lower.tail = F) [1] 0.9366611

If we were interested on the duration (hours) to work with a reliability of 0.75, the calculation would be like this:

> qweibull(0.75, 1.073, scale = 25.39, lower.tail = F) [1] 7.950374

Note that Genschel and Meeker demonstrated in this paper (2010) that, for most datasets, MLE was likely to produce more reliable estimates of Weibull parameters that MRR, and that this was consistent with evidence from several other independently published studies.

In our example, the framework will select MRR as the default fitting method so take into account overriding this behaviour and selecting MLE via the 'method.fit' as shown in the code.

The mean or MTTF of the Weibull2p probability density function is $\eta \u200a\Gamma (1+1/B)$.

So the MTTF in our example:

> mean <- 25.39*gamma(1/1.073 + 1) > mean [1] 24.70748

Other way to calculate the mean via the fitdistr() function would be like this:

> install.packages('fitdistrplus') > library(fitdistrplus) > xx.data <- read.csv('ds-wb.csv')$val > xx.fit <- fitdistr(xx.data, "weibull") > xx.coef <- coef(xx.fit) > h <- xx.coef["shape"] > s <- xx.coef["scale"] > l <- 0 > mean <- l + s*gamma(1/h + 1) > unname(mean) [1] 24.70071

**Wrapping up**

This post comments on the survival function, the failure rate, some common analysis methods (FMEA and FTA) and the usual failure metrics (MTTR, MTTF and MTBF). They are all required concepts to understand and operate the R packages supporting survival analysis and reliability.

Along the entry two simulated datasets were used to illustrate two different approaches: the non-parametric Kaplan-Meier product-limit estimator and the parametric Weibull two-parameters model.

The entry also comments on the emergency repair vs the preventive maintenance, the optimal maintenance rate and how all these aspects impact on the fleet budget, cost, risk, etc.

- Open Source UAV and mobile cellular networks
- Open Source UAV API, DroneKit-Python and Geopy
- Open Source UAV Autopilot with Ardupilot and Pixhawk