Masked Data Analysis based on the Generalized Linear Model

In this paper, we consider the estimation problem in the presence of masked data for series systems. A missing indicator is proposed to describe masked set of each failure time. Moreover, a Generalized Linear model (GLM) with appropriate link function is used to model masked indicator in order to involve masked information into likelihood function. Both maximum likelihood and Bayesian methods were considered. The likelihood function with both missing at random (MAR) and missing not at random (MNAR) mechanisms are derived. Using an auxiliary variable, a Bayesian approach is expanded to obtain posterior estimations of the model parameters. The proposed methods have been illustrated through a real example.


Introduction *
In a series system, the failure time and the exact component that causes system failure are important and can be used to estimate the reliability of component and system. However, in many cases, for reasons such as lack of diagnostic equipment, cost and time constraints, the exact component that causes the system to fail is not known and we only know that it belongs to a smaller set of components. Data with this feature is called masked data [1,2,3].
The problem of maximum likelihood estimates (MLE) in the presence of masked data has been considered by some authors such as Miyakawa [1], Usher and Hodgson [4] and Lin et al. [5], while Reiser [6], Berger and Sun [7],Mukhopadhyay and Basu [8] and Cai et al. [9] studied Bayesian statistical inference under masked data. Sen et al. [10] provided comprehensive details of the statistical analysis of system failure data under competing risks with possibly masked failure causes. Basu et al. [11] developed a Bayesian analysis for masked competing risks data from engineering systems and presented a general parametric framework for any number of competing risks and any distribution. Above mentioned works have been done under equiprobableassumption, that is, the masking probabilities do not relate to cause of failure(called the symmetry assumption by some authors).However, many authors have not considered this assumption, some of them are referred as follows. Lin and Guess [12] considered reliability estimation when the masking probability is related to the particular cause of failure. * Corresponding Author Email: hmisaii@ut.ac.ir Guttman et al. [13] developed a Bayesian method to estimate component reliabilities from masked system lifetime data when the masking probability is related to the true cause of system failure. Kuo and Yang [14] considered different probability model for the conditional masking probabilities, along with exponential and Weibull distributions for the component lifetimes. Mukhopadhyay and Basu [15] developed a Bayesian analysis for s-independent exponentials without the symmetry assumption using s-independent priors for the component failure rates and masking probabilities. Craiu and Duchesne [16] considered the maximum likelihood estimation of the cause-specific hazard functions and the masking probabilities via an EM algorithm. Mukhopadhyay [17] developed the maximum likelihood method to estimate the lifetime parameters and masking probabilities via an EM algorithm, and constructed approximate confidence intervals, also presented bootstrap confidence intervals. Xu and Tang [18] considered a Bayesian analysis for series systems with two components with Pareto distribution lifetime where masking probabilities are independent of time. Xu et al [19] presented a Bayesian approach for masked data in step stress accelerated life testing and considered log-location-scale distribution family for their study.
There is another type of incomplete data called missing data. Missing data occur when no data value is stored for the variable in an observation and have different mechanisms with respect to missingness reasons. If missingness depends only on observed values, missing mechanism is called missing at random (MAR), while if missingness depends on both observed and missing values, missing mechanism is called missing not at random (MNAR) (Little \& Rubin, 2002).
In this work, both classic and Bayesian statistical inference in the presence of masked data has been studied. Novelty of work lies on the definition a missing indicator for masking set of each observed failure time. That is, if the masked set is singleton set, missing indicator takes one, otherwise takes zero. Then, a generalized linear model (GLM)with appropriate link function is used to model missing indicator and it is involved into the likelihood function. This method allows to analyse masked data in a new manner which is more flexible than existing approach, specially when using of Bayesian method is desired.
The rest of the paper is as follow. In Section 2, model assumptions are introduced, and the general formulation of the likelihood function is given. In Section 3, the auxiliary variables are introduced, and the Bayesian analysis is discussed. In section 4, The proposed methodology is represented by a numerical example. Finally, a conclusion is given in Section 5.

Model Assumptions and Likelihood Function
Assumptions Suppose that we have r series systems under the test such that all of them have equal components, say J components. Assume that at the end of the test we observe failure data, ‫ݐ‬ ଵ ǡ ‫ݐ‬ ଶ ǡ Ǥ Ǥ Ǥ ǡ ‫ݐ‬ , but the exact cause of failure might be unknown, and only we know that belongs to the Minimum Random Subset (MRS) of {1, 2, ..., J}. Let ‫ܯ‬ be the observed MRS corresponding to the failure time ‫ݐ‬ Ǣ ݅ ൌ ͳǡ ʹǡ Ǥ Ǥ Ǥ ǡ ‫ݎ‬ for ith system. The set ‫ܯ‬ essentially includes components that are possible to be cause for system failure. If ‫ܯ‬ be a singleton set, then the data are competing risks data. While if ‫ܯ‬ ൌ ሼͳǡ ʹǡ Ǥ Ǥ Ǥ ǡ ‫ܬ‬ሽ then the system is called to be completely masked. We define the binary variable ܴ which takes the value 1, when ‫ܯ‬ is a singleton set and has zero value for masked data (when ‫ܯ‬ has more than one element). Thus, the observed data are (1) The model used in this paper is based on the following assumptions: x Let ଵ ǡ ଶ ǡ ǥ ǡ be the lifetimes of independent components, also assume that the system fails only due to one of the components, therefore system failure time is ൌ ൫ ଵ ǡ ଶ ǡ ǥ ǡ ൯Ǥ x ܶ , the failure time of the first component, follows a distribution in continuous distribution family with density and reliability functions denoted by ୪ ሺሻǡ ୪ ሺሻǤ x ሺ ൌ ୧ ȁ ൌ ୧ ǡ ୧ ൌ ሻ is called the masking probability, where ୧ denotes the exact cause offailure ofi th system. In this article, we assumeሺ ൌ ୧ ȁ ൌ ୧ ǡ ୧ ൌ ሻ ൌ ሺ ൌ ୧ ȁ ୧ ൌ ሻ ൌ ୪ ሺ ୧ ሻ, that is, the masking probability is independent of failure time, but is dependent to the cause of failure. x ୪ ሺ ୧ ሻs have some constraints. SupposeMbe the all of nonempty subsets of{1,...,J}thathaveʹ െ ͳmembers.
x Let Tbe the system failure time, the reliability function is given by Where Ʌ ൌ ൫Ʌ ଵ ǡ Ǥ Ǥ Ǥǡ Ʌ ൯ and Ʌ ୪ is parameters set related to componentl. x Let K be a random variable which indicates the indicator for the failure cause. Then the jointprobability distribution function of (T, K) is given by x ୧ is a Bernoulli variable with success probability is some appropriate link function (e.g. logit, probit, clog-log,...). Whenߚ ଵ ൌ Ͳthe missing is ignorable and missing mechanism is MAR.

Likelihood Function
The likelihood function for data (1) can be written as follow: Where ߚ ൌ ሺߚ ǡ ߚ ଵ ሻǡ and ߠis the vector of parameters related to lifetime distributions.
For simplify let ୫ୟୱ୩ ൌ ሼͳ Ǣ ୧ ൌ Ͳሽdenotes the set of indices for masked data. Therefore, the complete likelihood function for data (1) is rewritten as follows:

/3
Masked Data Analysis based on the Generalized Linear Model , , If the missing mechanism is at random ߚ ଵ ൌ Ͳthen the above likelihood is reduced to: Where the part related toܴ Ԣ‫ݏ‬ could be ignored and simple masked data analysis could be used.

Bayesian Analysis
Here, we define an auxiliary variable to simplify Bayesian likelihood function. Consider ୧୨ ൌ ሺ ୨ ൌ ୧ ሻ ͳ and ͳ , whereI(.) is the indicator variable such that shows the exactcauseoffailure. ୧୨ ൌ ͳ means that thei th system has been failed due to component where ‫א‬ ୧ .Note that, if ୧ ൌ ሼሽ is a singleton set, that is the failure cause is known,then ୧୨ ൌ ͳ and ୧୨ᇱ ൌ ͲǢ Ԣ ് . Therefore, likelihood functions (5), (6) and (7) can be rewritten as follows (8), (9) and (10), respectively: [ ( Because of conjugacy a suitable prior for ୨ is the Dirichletdistribution,D(ɀ ୨ ),whereɀ ୨ isaʹ ିଵ dimensional vector. The choice of prior distributions for other parameters will be s-dependent on theCDF that is considered for ୪ .
If ʌ(ș),ʌ(ȕ) andɎ ሺ‫‬ ሻ be the priors for parametersș,ȕand ୪ respectively, then the joint density function of (t,M,I,R) isresulted as The full conditional posterior distribution of ୪ is also a Dirichlet distribution, but its parameters depend on the observations. The full conditional distribution As a especial case, the likelihood function for exponential distribution with parameterߙ forl th component based on (8) isobtained as follows: And the likelihood function for the Weibull distribution with parameters ሺߤ ǡ ߚ ሻforl th component based on (8) is as follows:

Numerical Example
In this section, we illustrate the application of the proposed methods by two simulation data sets and a real data.
Using 10,000 iterations of Gibbs sampling with burn-in 2,000 iterations and length of the thinning interval 5, the posterior estimates of the parameters based on (15) and 1,600 posterior samples are listed in Table 3.   Table 4. In Table 4, Mean is referred to the average posterior estimates of model parameters and SRMSE is referred to the square root of the mean squared errors. As we expected, it is observed that as masking probability increases the SRMSE becomes larger.
True values of parameters and the corresponding bias of ଵ ǡ ଶ ǡ Ɋ ଵ Ɋ ଶ based on 1000 iterations are also given in Table 5. According tothe results, MNAR model has less bias compared with the usual MAR model.
The simulation has been repeated 200 times in order to avoid of randomness effects. Different cases of masking probabilities have been considered such as, ሺ ଵ ǡ ଶ ሻ ൌ (0.3,0.5), (0.7,0.3), and (0.8,0.8). Based on the obtained results in Table 7, as the masking probability increases SRMSE becomes larger. To motivating our study, we consider the real dataset given in Levuliene [8] from a test recorded bus tire failure times (T) and corresponding cause of failure (V).
In this data, we ignored soft failures and randomly masked 100 ଵ percent and 100 ଶ percent of those that failed due to first and second competing risks, such that ଵ = 0.1 and ଶ = 0.2. A Weibull distribution was fitted to these data, also for implementation logit model we consideredȾ ൌ െͲǤͷǡ Ⱦ ଵ ൌ ͳ and MLEs of parameters based on (14) have been presented in Table 8. Using bellow non-informative priors, posterior estimates are presented in Table 9.

Conclusion
In this paper, we have introduced a new approach for handle masked data. We proposeda generalizedlinear model to conduct relationship between masking probability and exact cause of failure using a binary variable. The simulation results show that the proposed method provides good estimations for modelparameters under both maximum likelihood and Bayesian methods.