Mechanical Equipment Reliability Analysis: Case Study

In civil and mining industries, Wheel loaders are an important component and their cost capability at effective operation. The environmental and operational factors dramatically affect the performance of loaders. In many cases, failure data are often collected from multiple and distributed units in different operational conditions, which can introduce heterogeneity into the data. Part of such heterogeneity can be explained and isolated by the observable covariates, whose values and the way they can affect the item's reliability are known. However, some factors that may affect the item's reliability are typically unknown and lead to unobserved heterogeneity. These factors are categorized as unobserved covariates. In most reliability studies, the effect of unobserved covariates is neglected. This may lead to erroneous model selection for the time to failure of the item, as well as wrong conclusions and decisions. There is a lack of sufficient knowledge, theoretical background, and a systematic approach to model the unobserved covariate in reliability analysis. This paper aims to present a framework for reliability analysis in the presence of unobserved and observed covariates. The unobserved covariates will be analyzed using frailty models (Such as Mixed Proportional Hazard).A case will illustrate the application of the framework..


Introduction
Wheel loaders are widely used in mines, construction projects, and waste processing fields due to their merits of high mobility, remarkable operational flexibility, and relatively low capital cost. Wheel loaders on construction sites are used for digging shallow trenches, laying pile lines, and filling pits. These activities are tedious, and the weather conditions, which depend on geographical locations, are often harsh [1]. In surface mines, wheel loaders are mainly used for removing topsoil, loading fragmented rock and ore from the production bench, and cleaning operation sites. Loading and hauling constitute 35-46% of the daily operation cost of a surface mine [2]. Since there is a huge volume of rock/soil to be moved, wheel loaders used in surface mines are very big and cost quite a large percentage of investment. This dictates that this equipment must be used efficiently to reduce the unit cost of the product. Finding enough experienced operators for this type of complex equipment has become a problem for mining companies. A wheel loader typically undertakes a typical job of loading fragmented rock from a pile into a truck or ore chute. The wheel loader completes this job with several cycles that each consist of three activities: loading, hauling or maneuvering, and dumping. But the failures will be stopped in the process, and the management will try to solve them. The question is, "How can I do it?". Reliability analysis is one of the most popular ways. In this way, engineers and managements use the previous system operation data to estimate future conditions. In many reliability studies, data sets are assumed to be homogeneous, where the failure data are independent and identically distributed [3]- [5]. However, in practice, homogeneous failure data can scarcely be found. In reality, failure data are often collected from multiple items, distributed at different locations, and working under different operational conditions (e.g., operator skill, maintenance strategies, etc.) [6]. This may introduce heterogeneity into the data [7], [8]. Differences in failure intensity are generally called heterogeneities and can be due to either observed or unobserved influence risk factors, which are called covariates [9]- [11]. Covariates describe the item's characteristics or the environment in which it operates [12], which may have different levels. For example, as a covariate on the reliability of a pump, vibration may be of high, low, or medium levels [13]. Observed covariates are factors whose effects on the failure process are known, and their associated levels are recorded with the failure data; they can be time-dependent or time-independent. Time-dependent observed covariates vary continuously with time. Unobserved covariates are covariates whose effects on the failure process are typically unknown, or their associated levels during the operating time or at the time of the failure are not available in the failure database [14]. Unobserved covariates may lead to unobserved heterogeneity [13], [15], [16]. For example, in a production process, some pumps may have a soft foot problem due to a defect in the installation process. The soft foot problem will put the bearing in an over-stressed situation; this should be considered a covariate for reliability analysis. Suppose there is no information regarding soft foot in the failure database of the bearing.
In that case, an unobserved covariate should be defined to capture the effect of the soft foot on the reliability of the bearing. In general, due to the quality of manufacturing, installation, operation, and maintenance procedures, some items may become frailer, while others are more robust. In the presence of unobserved covariates, different items may have different levels of frailty. Unobserved covariates are typically unknown or unavailable for each item; hence, they cannot be explicitly included in the analysis. The result of our literature review revealed that, in many cases, for reasons of practicality and simplicity, unobserved covariates are eliminated during the failure data analysis [9], [11], [13], [14], [17]. However, if unobserved covariates are neglected, the reliability analysis only represents the reliability of items with an average level of frailty and not that of the individual items. In general, high-risk items (high frailty) tend to fail earlier than low-risk items (low frailty) for unobserved reasons; thus, the population composition changes over time. Hence, in time, the analysis represents the item with low frailty, and the estimated reliability increases more with time than the reliability of a randomly selected item of the population [18], [19]. The Cox regression model family, such as the proportional hazards model (PHM) and its extension, for example, the stratification approach, is the most dominant statistical approach for capturing the effect of covariates on the reliability performance of an item [6], [13], [17], [20]-[24]. In PHM, the hazard rate of an item is the product of a baseline hazard rate and a positive functional term that describes how the hazard rate changes as a function of covariates. However, the PHM is very sensitive to the omission of the covariates and cannot isolate the effect of unobserved covariates [13]. In survival analysis in medical science, the frailty model introduced by Clayton [25] and Vaupel et al. [16]describes the influence of unobserved covariates in a proportional hazards model. A frailty model is a random effects model for time variables, where the random effect (the frailty) has a multiplicative effect on the hazard [15], [23]. Gamma distribution, inverse Gaussian or exponential distribution can be used to model the frailty [6], [14], [15], [26].
Recently, some studies in the reliability field have used the frailty model to model the effect of missing covariates on the reliability of an item in making maintenance decisions [6], [9], [24], [27]. To get a glimpse of the status of the applications of the frailty model in recent years, we conducted an online search of Scopus in December 2018 with the relevant keywords. We found that 33 papers with the keyword 'frailty' and 'reliability' have been published since 2008. Of these, only seven are related to applying a frailty model in reliability engineering with a focus on maintenance purposes. Asha [28] incorporated the frailty model into the load share systems and showed that reliability analysis for a heterogeneous case could differ dramatically from that for a homogeneous setting. Xu and Li [29] obtained stochastic properties of univariate frailty models, which are a special case of multivariate frailty models, and Misra [30] used stochastic orders to compare frailty models arising from different choices of frailty distribution. Asfaw and Lindqvist [23], Finkelstein [31], Slimacek and Lindqvist [32] and Giorgio [11] modeled the frailty by the introduction of Non-Homogeneous Poisson Process (NHPP). For example, Slimacek and Lindqvist [32] used frailty to model the effect of unobservable differences between turbines, as unobserved covariates, on the reliability of wind turbines using the Poisson process. It must be mentioned that most of the available studies do not discuss how the time-dependent covariates should be handled in the frailty model; nevertheless, ignoring the time dependency of covariates is the most practical approach in these studies. Moreover, the required statistical tests for the investigation of observed and unobserved heterogeneity among the failure data are not discussed. It seems that simplistic assumptions, the lack of a systematic approach and inadequate theoretical background are the main barriers to the proper application of the frailty model in the reliability field. To overcome these challenges, the main contribution of this paper is to present a framework for failure data analysis in the presence of unobserved and observed covariates. The framework is based on the mixed proportional hazards model, originally developed by Lancaster [33]

Reliability Analysis of wheel loaders
Jajarm Bauxite Mine, located in Iran, has 19 main open mines in the city of Jajarm. The longitudinal area of the mine from west to east (namely: Golbini 1-8, Zou 1-4, Tagouei 1-6, and Sangtarash) accounts for 16 kilometers. The length of each section is as follows: Golbini: A total of 4.7 km, Zou mines: A total of 3.3 km, Tagouei mines: A total of 5 km, and Sangtarash mine: About 3 kilometers in length. The Jajarm bauxite falls in the lens-like layer category. The expanse of bauxite is mostly in the form of layers. The minerals lying on the karstic-dolomites form the Elika formation, which lies under the shales and sandstones of the Shemshak formation. The bauxite layer is not made of uniform thickness and consistent quality. The bauxite layer generally ranges from less than 1 meter to about 40 meters in thickness. The case study was focused on the failure data of two-wheel loaders ( = 2) from Kaj-Mahya Company, which were put into service in the Jajarm Bauxite Mine. The wheel loaders' major design characteristics (weight, size, maximum load capacity, etc.) were almost the same. The systematic framework for reliability analysis of reliability data in the presence of observed and unobserved covariates (observed and unobserved heterogeneity) is described in Figure 1. This methodology is based on four important steps:  Establishing the context and data collection  Identifying the baseline hazard rate based on maintenance nature  Modeling the effect of the covariates  Parameter estimation GOF Figure 1. A framework for reliability model selection for Civil and mining Wheel loaders [34], [35], [39], [41] As this figure shows, the context should be established in the first step. In this step, all external and internal parameters to be considered when analyzing failure data and setting the scope and assumptions for the reliability analysis should be defined. External context is the external environment in which the item is going to work, such as ambient temperature, pressure, humidity, etc. Internal context is the internal conditions related to the item itself and the company running and maintaining the item, including the repair and physics of failure, operator condition, maintenance crew, etc. Understanding the external and internal context is important to identify the observed covariates. For example, based on the physics of failure, road conditions can contribute to the failure of a truck in a mine; hence, it should be considered as a covariate in the reliability analysis of the truck. In this step, the possible relationships between different covariates should be investigated, as well as the possible level for each of them.

Modeling the effect of the covariates
In the next step, failure data and all possible observed covariates associated with each failure should be collected. After that, based on the nature of the failure data (e.g., trend behavior of the data) and the type of repair strategy, the appropriate baseline hazard should be selected for the data. For example, the common assumption for a repairable system can be i) perfect repair or good-as-new condition, ii) minimal repair or bad-as-old condition, or iii) jumps in the hazard rate after repair or different baseline hazard rate. Under the perfect repair strategy, the item is restored to a 'good-asnew' condition, and the main assumption is that the hazard rate is reset to that of a new system after maintenance. If the times between failures are independent and identically distributed (iid), it can be concluded that the item went through perfect repair [23]. In such cases, classical distribution, such as the Weibull distribution, can be used to model the baseline hazard rate.
In the case of minimal repair (bad-as-old), an item has the same intensity function after repair as before the failure. The failure times when the minimal repair is carried out can be considered a non-homogeneous Poisson process. In other words, the baseline hazard rate will be modeled using a non-homogeneous Poisson model. However, it should be mentioned that, on some occasions, such as overhaul, the system may return to a 'good-as-new' condition. Under this condition, it is assumed that the NHPP is cyclic, with each cycle starting as a renewal process and, within the cycle, failure times follow the NHPP. In this case, the failure data will then be categorized by these occasions (for example, overhaul). Then a stratification approach is used to estimate the effect of each covariate, while the NHPP model models the baseline hazard rate. However, the baseline hazard rate will change when a fleet of items is analyzed after some time and undergoing several repairs. For example, in some cases, as the number of failures increases, the average failure time decreases; hence, the baseline hazard rate will not be identical for a particular failure number. Here, the failure data can be categorized based on the failure number; it can be used to define strata, and then the stratification approach can be used to model the fleet failure data.
In general, the first step in analyzing the collected failure data of a repairable system is to check the trend of the failure data. If the data shows a trend, the NHPP, such as the power low process model or trend renewal process (TRP), can be used to model the baseline hazard rate. However, when there is no trend in the data, classical distribution, such as the Weibull distribution, can be used to model the baseline hazard rate. However, some goodness-of-fit tests, such as residual tests, should be used to find the best fit distribution for failure data. For more information regarding the trend test, see [9], [42].
In the next step, the time dependency of observed covariates should be checked. Later, the failure data need to be investigated for unobserved covariates. Data sets without unobserved heterogeneity will be analyzed using the classical proportional hazards model, including the proportional hazards model (when all observed covariates are time-independent) and the extension of the proportional hazards model (in the presence of timedependent covariates). Moreover, data sets with unobserved heterogeneity will be analyzed using the mixed proportional hazards model (MPHM) family.
We had to collect the failure data along with related observed risk factors in the first step. To do so, we needed to identify the observed risk factors, as seen in Table 1. According to the Table, we identified 11 observed risk factors with a possible effect on the reliability of the wheel loaders. The training processes were different in these companies, leading to operators' different levels of skill; as a result, wheel loaders would experience different levels of stress. Altogether, these can cause various failure rates for wheel loaders identified in these companies. The numbers in square brackets in Table 4. were used to nominate (formulate) the risk factors. For example, wheel loaders work in three shifts, namely, Morning, Afternoon, and Night shifts; thus, we used 0, 1, and 2 here to refer to these shifts, respectively. A sample of data is shown in Table  1. Overcast [3] Dense fog [4] Under the hypothesis of homogeneity among the effect sizes, the Q statistic follows a chi-square distribution with k -1 degree of freedom, k being the number of studies. However, these statistical tests and their applications are limited, mainly due to their requirements, in terms of data and assumptions. Each test is optimum to detect the heterogeneity of a specific form [46], [47]. For example, a shortcoming of the Q statistic is that it has poor power to detect true heterogeneity among studies when the meta-analysis includes a small number of studies and excessive power to detect negligible variability with a high number of studies. Recently, the I 2 index has been proposed to quantify the degree of heterogeneity in a meta-analysis [39]. A likelihood ratio test, the Akaike information criterion (AIC), and Bayesian information criterion (BIC) are common tests for checking the hypothesis of the presence of heterogeneity against the null hypothesis of non-heterogeneity ( = 0). In general, the AIC performs well when heterogeneity is small, but if heterogeneity is large, the BIC will often perform better [9], [12], [49]. For example, in the case of Weibull distribution for the baseline hazard rate, the likelihood ratio can be written as:

Heterogeneity test for unobserved covariates
Here, and are estimated parameters for Weibull distribution, ̂ is the regression coefficient for observed covariates, and can be interpreted as the degree of heterogeneity [9]. These parameters can be estimated by maximizing the full likelihood function. On a 5% significance level, the null hypothesis (no heterogeneity) will be rejected if ≥ 2.706. Moreover, under the minimal repair strategy, a power law can be used to represent the intensity function. Under the assumption of the power law intensity function, a three-step likelihood ratio test procedure can be performed to check whether a significant amount of heterogeneity among units exists [9]. As the first step, the null hypothesis, say : = , = , = , = , should be tested against the alternative hypothesis, : ≠ , ≠ , ≠ , ≠ . In the second and third steps, common λ, uncommon β and uncommon λ, common β should be carried out, respectively [11].
We employed the ratio test to assess the heterogeneity of data. Based on our assumption, the baseline hazard rate was represented by the Weibull distribution. Therefore, using the Weibull distribution, we wrote the failure rate of the wheel loaders as the baseline hazard, and the frailty model was represented by the gamma distribution with the following features (mean value = 1 & variance = θ) as follows: Then, we did the likelihood ratio tests as below: The P-Value for the R=33.27 was obtained as 0.000, suggesting the effect of unobserved risk factors (unobserved heterogeneity) on the reliability of wheel loaders' Time dependency test of observed covariates. Hence, as in Figure 1, the Mix Proportional Hazard Model (MPHM) or Extension Mix Proportional Hazard Model (EMPHM) should be used to analyze the data.

Time dependency test of observed covariates
There are two general approaches for checking the time dependency of covariates: i) the graphical procedure and ii) the goodness-of-fit testing procedure [17]. The developed graphical procedure can generally be categorized into three main groups: i) cumulative hazards plots, ii) average hazards plots, and iii) residual plots [13]. For example, in cumulative hazards plots, the data will be categorized based on the different levels of the covariate that is to be checked for time dependency.
Consider that a covariate can be categorized into r levels, in which the covariate is equal to z r . After that, the hazard rate can be written as: Where is the same as before, with omitted, with i=1,2,…p 1 and j≠r. If the PH assumption is justified, then we will end up with: A similar relation can be concluded for the cumulative baseline hazard rate. Hence, if the assumption of PH is justified, then the plots of the logarithm of the estimated cumulative baseline hazard rates against time for defined categories should simply be shifted by an additive constant, . In other words, they should be approximately parallel and separated, corresponding to the different values of the covariates. Departure from parallelism of the above plots for different categories may suggest that is a timedependent covariate. For a review of other graphical approaches, see [13], [17], [50]- [52].
In the same way as the cumulative baseline hazard rate, a log-log Kaplan-Meier curve over different (combinations of) categories of variables can be used to check the assumption of PH. A log-log reliability curve is simply a transformation of an estimated reliability curve that results from taking the natural log of an estimated survival probability twice. If we use a PHM or MPHM and plot the estimated log-log reliability curves for defined categories on the same graph, the two plots would be approximately parallel [13]. In the residuals plot in the first step, the residual should be estimated by using the estimated values of the cumulative hazard rate, ( ),and the regression vector as: If the PH assumption is justified, then the logarithm of the estimated reliability function of against the residuals should lie approximately on a straight line with slope -1 [13], [53]. A transformed plot of the partial residual suggested by Schoenfeld can also be used as an exploratory tool to detect the time-varying effects of a covariate, even when the a priori form of time dependence is unknown [54]- [56]. The Schoenfeld Residuals Test is analogous to testing whether the slope of the scaled residuals on time is zero or not. If the slope is not zero, then the proportional hazard assumption has been violated [47]. When the covariates are quantitative, using graphical approaches is challenging, as it is difficult to define different levels for quantitative covariates and to decide whether the plots are parallel. In such cases, it is better to use a goodness-of-fit testing procedure such as the chi-squared goodness-of-fit test [5], [57], [58], the log-rank test [5], [57], the likelihood ratio test [5], [57], score tests [57], [59], the doubly cumulative hazard function [60], the Wilcoxon test [61] and generalized moments specification tests [62]. For example, if the PH assumption is justified, the different two-sample tests, e.g., generalized Wilcoxon and logrank tests, should have the same results [13].
As seen in Table 2 we had to examine the time dependency of the risk factors following the collection of data and observed risk factors. We applied the graphical approach (A ln-ln reliability curve) to evaluate the time dependency of all risk factors. The -ln (-ln reliability) against the ln (analysis time) for two observed risk factors is shown in Figure 2, which were named Rock kind (z ) and System ID (z ). Based on these two graphs, the curves are approximately parallel in both; thus, the proportionality assumption applies to the data sets; therefore, the risk factors are concluded to be timeindependent. The -ln (-ln reliability) for other risk factors confirmed the same result. Hence, as in Figure 1, the MPHM should be used to analyze the data. Software such as STATA, R, SAS, and SYSTAT can be estimated the MPHM parameters.

Parameter estimation
In the MPHM, the hazard rate of an item is the product of a baseline hazard rate multiplied by two positive functions: i) observed covariate function and ii) an unobserved covariate function (frailty function). Suppose we have a fleet of j items, the hazard function for an item at time t > 0 is: Where ℎ ( ) is an arbitrary baseline hazard rate, dependent on time alone, z is a row vector consisting of the observed covariates associated with the item, is a column vector consisting of the regression parameters for identified observed covariates, and is a timeindependent frailty function for item j and represents the cumulative effect of one or more unobserved covariates. In general, the baseline hazard rate (ℎ ( )) may either be left unspecified or can be modeled using a specific parametric form such as Weibull distribution or NHPP. According to the MPHM, the fleet of items (the population) is represented as a mixture in which the ℎ ( )and ( ; ) are common to all items, although each has its frailty. The observed and unobserved covariates can affect the hazard rate so that the actual hazard rate (ℎ ( ; ; )) is either greater (e.g., in the case of higher vibration level or poor maintenance) or smaller (e.g., better training for operators, installation of a new ventilation system) than the baseline hazard rate. Moreover, the items with > 1 are said to be frailer for reasons left unexplained by the observed covariates and will have an increased risk of failure. The items for whom < 1 are less frail; hence, given a certain observed covariate pattern, they tend to be more reliable.
For MPHM, given the relationship between the hazard rate and the reliability functions, it can be shown that the conditional (item) reliability function, ( ; ; ( )| ), conditional on the frailty, , is [14]: The unconditional (population) reliability function can then be estimated by integrating out the unobserved . If has probability density function g(α), then the population or unconditional reliability function is given by: ( ; ; ( )) = ( ; ; ( )) ( ) (8) Where we use the subscript θ to emphasize the dependence on the frailty variance θ, the relationship between the reliability function and the hazard function still holds unconditional on α, and, thus, we can obtain the population hazard function using [14]: Having the gamma distribution with unobserved covariates [14]: Where f θi is the probability density function. In a shared frailty model, suppose we have data for i = 1,...,n groups, with j = 1,...,n i observations per group, consisting of the trivariate response (t 0ij , t ij , d ij ),which indicates the start time, end time, and failure/censoring for the jth item from the ith group, while the shared frailties follow a gamma distribution, L i can be expressed compactly as [14]: Where = ∑ . Given the unconditional group likelihoods, we can estimate the regression parameters and frailty variance θ, by maximizing the overall log-likelihood = ∑ ln . In shared-frailty Cox models, the estimation consists of two steps. In the first step, the optimization is in terms of θ alone. For fixed θ, the second step consists of fitting a standard Cox model via penalized log-likelihood, with the ν i introduced as estimable coefficients of dummy variables identifying the groups. The same approach can be used to estimate the likelihood functions for EPHM, MPHM, and PHM. For more information, see [9], [13], [14]. Table 3 and Table 4    Where ( ) = loaders, uncon wheel loaders a population haz factor mean, b) In the ne effect for un possible bias effect of unobs selecting the P The figures sh loaders' popula loaders. Base coefficients of impact on the PHM compare  underestimate the effect of Company, Working shift, Rock fragmentation, and slope of the road and overestimate the effect of Proportionality of wheel loaders and Number of services. Moreover, under the assumption of PHM, the Rock condition significantly affects the hazard rate of wheel loaders. In contrast, its effect is insignificant when MPHM is used to model the failure data. Hence, for any decisions on the operation and maintenance strategy, the effect of unobserved covariates should be considered. Finally, although we have modeled the effect of unobserved covariates on the reliability of wheel loaders, to enhance the accuracy of the estimation model in future work, we need to explore the unobserved covariates further.