An overview of reinforcement learning and deep reinforcement learning for condition-based maintenance

Condition-based maintenance (CBM) involves making decisions on maintenance based on the actual deterioration conditions of the components. It consists of a chain of states representing various stages of deterioration and a set of maintenance actions. Therefore, condition-based maintenance is a sequential decision-making problem. Reinforcement Learning(RL) is a subfield of Machine Learning proposed for automated decision-making. This article provides an overview of reinforcement learning and deep reinforcement learning methods that have been used so far in condition-based maintenance optimization


Introduction
Industrial systems are in general subject to degradation because of usage and exposure to environmental factors. This degradation eventually leads to system failure, resulting in safety issues, equipment damage, quality issues, and unexpected machine unavailability [1]. A few decades ago, maintenance was mostly considered something that had to be done after such a failure, but it was also something that was difficult to manage. Maintenance is widely recognized as an essential business function and a critical element of asset management [2]. To keep a system ready for operation over a specified time frame, maintenance actions are required. Traditionally, maintenance actions are classified into corrective maintenance (CM) and preventive maintenance (PM) [3].In CM, a failed system is replaced by a new one, while PM includes specific actions proposed to avoid system failure or reduce the risk of system failure. Recently, another maintenance strategy, the so-called CBM, has received increasing attention thanks to the development of sensor technology. In CBM, the real-time condition of a system is monitored to determine what maintenance needs to be performed [3].
CBM involves making decisions on maintenance based on the actual deterioration conditions of the components [1]. It consists of a chain of states representing various stages of deterioration and a set of maintenance actions [1]. Therefore, CBM is a sequential decision-making problem. Such sequential decisionmaking problems, often modelled as Markov decision processes (MDPs), could be solved by reinforcement learning (RL) algorithms that have been recently taken attention [4]. Thus, as an optimization tool in the dynamic, uncertain environment, RL could provide an optimal decision strategy (policy) for the CBM problem [5]. For this purpose, the maintenance problem is first converted into an RL framework; then, RL algorithms are applied to obtain an optimal policy [7]. This work aims to review the application of RL as a subfield of Machine Learning (ML) in the maintenance model field. In the following, we first review RL and its algorithms briefly.
RL is a subfield of ML focusing on Artificial Intelligence (AI) which deals with learning from repeated interactions with an environment [6]. A learner (decision maker) is called an agent who interacts with the environment by performing specific actions and receiving feedback from the environment [7]. The feedback is usually termed as a reward. The agent's goal (objective) is to maximize cumulative rewards by learning to perform better [7].An MDP usually describes the environment, consisting of a state space, an action space, a reward function, and state transition probabilities.Therefore, MDP for an RL problem has the following components [11,17,9].

/
Z. Dehghani Ghobadi, F. Haghighi, A. Safari  is a set of states, and at each time step , the state is ∈ .  ( ) is the set of possible actions, and the action at time t and in state is ∈ ( ).
is the transition probability of beginning in state ′ at time + 1, if the system was in state at time , and the agent chooses action .  is the reward at time . ∈ (0, 1) is a discount factor, and the discount factor essentially determines how much the RL agents care about the rewards in the distant future relative to those in the immediate future. Figure 1 shows agent-environment interactions in an MDP (for more details, see, e.g., [8]). The maintenance of a system is usually planned based on the system failure mechanism. Generally, a system can fail due to degradation, shock, or both. If the system failure is only because of degradation, then a degradation model (a stochastic process or a deterministic path) is used to model the failure mechanism. In this case, CBM is adjusted based on the information received from the system degradation. If the system failure occurs due to the shocks, a shock model is considered to model the failure mechanism depending on the types of shocks. In this setup, CBM is designed based on information about the shocks, including the number of shocks, their magnitude, and how shock affects the system failure. In a more complex case where the system failure is modeled jointly based on the degradation and arrival shocks, both degradation and shock information are used in the design of the CBM. We will focus on only the first and third cases here.
The paper is organized as follows. Section "Literature review" introduces previous related research on maintenance policies for complex systems with reinforcement learning. The procedure for the CBM approach and taxonomy of Reinforcement learning algorithms are introduced in the section "Research fundamentals." The third section explains two problems of optimal policy in CBM by RL in which the problem is considered a Markov decision-making problem, and the fourth section describes how a semi-Markov decision process formulates a CBM problem to apply an RL approach. The fifth section explains the CBM problem that is modeled as a continuous-state MDP without discretizing the system degradation state, the sixth section illustrates how to find the optimal CBM policy with Deep reinforcement learning (DRL), and the section "Conclusion" presents the conclusion and future work.

Literature review
Only a few studies have investigated RL to find an optimal condition-based maintenance schedule to minimize the cost. Adsule et al. [1], modeled the CBM decision-making problem as a continuous semi-Markov decision process (CSMDP), and applied an RL algorithm. Yousefi et al. [9], modeled the CBM decision-making problem as an MDP and also used an RL algorithm. Peng et al. [10], modeled the problem of CBM as a continuous Markov decision-making process without discretizing the degradation states under a Gaussian process (GP) and then applied an RL algorithm. Mahmoodzadeh et al. [11], proposed the CBM optimal policy using an RL algorithm for gas pipelines. Yousefi et al. [6] presented a DRL method to provide a new dynamic maintenance model for a degrading repairable system subject to degradation and random shocks. Zhang et al. [12] proposed a novel and flexible CBM model based on a custom DRL for multicomponent systems with dependent competing risks. Table 1providesa summary of the studies mentioned.

Procedure for CBM approach
The CBM can be done by (1) gathering product status data and monitoring; (2) making a real-time diagnosis of a product status; (3) estimating the deterioration level of the product, and its repairing cost, which depends on the deterioration level, or its replacement cost, and so on; (4) predicting the time of products abnormality; and (5) executing appropriate actions such as repair, replace, left to use as it is, and disposal. Figure 2 shows the generic procedure for implementing CBM.

Taxonomy of Reinforcement learning algorithms
The RL algorithms could be classified from different perspectives. Here, we classify the RL algorithms based on whether the environment model is assumed tobe known. A taxonomy of RL algorithms based on such classification is given in Figure 3. Taxonomy of Reinforcement learning algorithms [8] Note that a "model" means an ensemble of acquired environmental knowledge. Whether the environment model is used or not, RL algorithms can be classified into model-free and model-based classes [7]. In modelbased RL, all elements of the environment MPD are known, and the RL algorithms will use them in learning the optimal policy [7]. The model-based methods can be split into two categories: given model and learning the model [7]. In the given model methods, the reward function and the transition process can be accessed directly by the agent (e.g., Gaussian Process for reinforcement learning [GPRL]) [10].
In contrast, in learning the model methods, the agent can learn the model from interactions with the environment first and then apply the learned model to find the optimal policy [7].Model-based approaches can become impractical in many realistic applications (Huang [18]).Alternatively, the optimal policy can be obtained directly without knowing the environment model. This class is called model-free RL. The modelfree methods fall into two main categories: value-based and policy-based. The value-based methods usually imply that first learning the action-value function (Q(s, a): cumulative discounted reward by starting from state and taking action ), and then obtaining the optimal action corresponds to the highest cumulative discounted reward based on the learned Q (s, a) [8]. Another approach is optimizing the policy directly (without learning Q(s, a)), which is called the policy-based method. The value-based methods are divided into the on-policy and off-policy methods. The on-policy methods learn or improve the policy that the agent is Mahmoodzadeh et al. [11] SMART Adsule et al. [1] Deep Qlearning Yousefi et al. [6] Zhang et al. [12] On-policy acting upon in its interactions with the environment, such as SARSA [7], whereas the off-policy methods can learn or improve a policy that is different from the one that the agent is using to take action in the environment such as Q-learning [10,12], Deep Q-learning [6,13] and SMART [1]. With off-policy methods, the experience of other agents interacting with the environment can also be used to find the optimal policy. The policy-based methods are classified into two categories: gradientbased and non-gradient-based. The gradient-based methods can be used to improve parameterized policies, and the non-gradient-based method is applied to optimize less complicated policies. More details about the classification of RL algorithms can be found in [6].

Finding the optimal policy with the RL approach to solve the CBM problem as an MDP
Yousefi et al. [9], considered an RL approach to develop a new dynamic CBM policy for multi-component systems with individually repairable components. The following assumptions concerning to failure model have been made in their work. 1. Each component is subject to two competing failures: the process of degradation and random shock. 2. A gamma process is used to model the degradation path of each component. 3. Shock arrivals occur as a homogeneous Poisson process. 4. Each incoming shock may cause the system to fail immediately due to its magnitude, and it also affects the degradation path of the components. Let (t) be the component degradation level.To apply the RL approach, they converted the optimal maintenance problem to an MDP problem based on the following assumptions: , = 1,2,3 are some prefixed known degradation thresholds. 2. Actions space is = { , , } where , , and are "do nothing","repair a component", and "replace a component", respectively. Using the Q-learning method, they obtained the optimal maintenance actions for all the system degradation states. As an advantage, this method provides a dynamic maintenance policy for each specific degradation state of the system, which is more beneficial than the fixed maintenance plan. In another study, Mahmoodzadeh et al. [11], proposed a CBM policy via the RL method for gas pipelines. Gas pipeline systems are one of the largest energy infrastructures in the world and are known to be very efficient and reliable. However, this does not mean they are prone to no risk. Corrosion is a significant problem in gas pipelines that imposes large risks, such as ruptures and leakage to the environment and the pipeline system. Therefore, various maintenance actions are performed routinely to ensure the integrity of the pipelines. The costs of corrosionrelated maintenance actions are a significant portion of the pipeline's operation and maintenance costs. Minimizing this high cost is a highly compelling subject that many studies have addressed. Mahmoodzadeh et al. [11], investigated the benefits of applied RL techniques to the corrosion-related maintenance management of dry gas pipelines. In the mentioned work, as the first step, the pipeline's corrosion maintenance problem has been converted to a sequential decision-making problem by defining the problem in an MDP format. Because the scope of the research is the corrosion of the pipeline, the state definition should include all the essential information to predict the next corrosion status given the action. Therefore, they initially designed the state definition to include the depth and length of the corrosion. However, instead of directly taking the value of the depth and length, the max-normalized version of them has been considered and removed the agent's dependency on the pipeline's parameters. Equations (1) and (2) define the corrosion depth and length where the maximum corrosion depth is the wall thickness, and the maximum corrosion length has been estimated by running the model for 40 years without maintenance.
Representing the corrosion state with only the depth and length is inaccurate because the next stage of the corrosion is not predictable without knowing the rate of corrosion degradation. Therefore, the corrosion rate has been added to the state variables. They assumed the agent's access to the state variables is feasible only through monthly inspections of the corrosion depth and length. Therefore, the corrosion rate has been derived by comparing the current month's corrosion with the previous month's corrosion .Since corrosion is a slow and gradual process, the agent does not need high precision in state representation. The corrosion rate is represented (CRP) as a binary variable with a value of 0 when there is no corrosion aggravation and 1 when the corrosion exacerbates. The following equation formulates the corrosion rate presence as the third state variable.
Thus, they have discretized the state variables into 24 bins as shown in Table 2. A discrete action space of size 5 is considered for the agent as follows, {Do nothing, Batch corrosion inhibitor, Internal coating, Cleaning pigging, Replacement}.The details of the considered maintenance actions are shown in Table 3. The total reward after each month has been defined as the algebraic summation of the cost of failure, life extension reward, and cost of maintenance, Mahmoodzadeh et al. [11].The approach used in this research is entirely data-driven and model-free. The agent treats the model as a black box that mimics a real pipeline and emits the required data for the learning process. The Q-learning algorithm for the problem of pipeline optimal corrosion maintenance management has been applied. The results show that applying the proposed condition-based maintenance management technique can reduce up to 58% of the maintenance costs compared to a periodic maintenance policy while securing pipeline reliability. Minor maintenance (MM): minor maintenance means that a failed system is restored just back to a functioning state. After minor maintenance, the system continues as if nothing had happened. The likelihood of system failure is the same immediately before and after a failure. A minimal repair thus restores the system to an "as bad as old" condition. 3. Replacement through PM. 4. Replacement through CM. The choice of "no action" means no maintenance action is required and the component is allowed to work in its current state.

Finding the optimal policy by RL approach to solve the CBM problem as a continuous semi-Markov decision process
In this case, the maintenance action "minor maintenance" (MM) refers to the re-lubrication of the component surface, which will reduce its wear rate. A PM action results in the planned replacement of the components, which means we stop the machine with proper scheduling. The reward fu sts related to ) for a = N, PM n probability | , ). When . When = N ), has been es When S is a co or each , ∈ n assumed to be learn from have been c which assign ) = bular solving m ntenance prob hghani Ghobadi, F amework of MD amework of t tenance propo he system und s state at each an be directly mmonly leads ational burden the sensor da tions. Based on an action is t el can be fully ( | , ), r(s, t of decision y described r(s, a)}, wh sion epochs. In set of all pos ondition.Altho only its condit ken for decisio assumption. s in state ∈ an expectedrew , , } den unction is eva maintenance: M, or CM, res distribution ha = PM or = N, the state tra stimated from e ountable set, ∈ S, a ∈ A. Ot e a probability existing sam considered tha n a single actio ∈ 0, ) ∈ , ) ∈ , ∞) methods have blems mention F. Haghighi, A. Sa P for CBM [1] the MDP fo sed in this wo der maintenan decision epoch y represented b to large state s n, the system ata or determin n the system st aken on the sy An overview of reinforcement learning and deep reinforcement learning for condition-based maintenance sections. To handle a large or continuous state space that cannot be addressed by the tabular method, one can turn to the function approximation to model the state transitions of the system and the value functions (both state-value functions and state-action value functions). A general approximator is preferred when there is not enough information on the possible function to approximate value functions. Although neural networks can model various relationships, they usually require a large amount of data. The GPR can fit small datasets without loss of generality. As an application, they have demonstrated their proposed method to model the battery maintenance decisionmaking problem by an MDP, where the GPR describes the system dynamics and value functions. Using NASA battery randomized usage data, the Gaussian Process for reinforcement learning (GPRL) algorithm has been applied over the state value iteration. Compared with discrete MDPs, the GPRL algorithm appeared to return a similar optimal policy while being computationally more efficient. They showed that GPRL could save up to 11.9% (varies by different values of ) of the average cost compared to the MDP results.It is worth mentioning that the GPs have been widely adopted for stochastic modeling processes in reliability and maintenance studies. Also, as a general nonparametric model, GPR gains a reputation for its universality and good utilization of data, which is also easy to implement [15].

Finding the optimal CBM policy with Deep reinforcement learning (DRL)
Most existing research on CBM assumes that preventive maintenance should be conducted when the degradations of system components reach specific threshold levels upon inspection. However, searching for optimal maintenance threshold levels is often efficient for low-dimensional CBM. Still, it becomes challenging if the number of components gets larger, especially when those components are subject to complex dependencies. Another limitation of most existing CBM models is that they often ignore competing for failure risks when incorporating various types of dependencies, which are common in many real-world systems [16,17].In this context, competing risk refers to a system failure due to the failure of any of its components. For instance, a modern computer could fail due to the failure of its CPU, storage unit, or operating system, whichever occurs first. The competing risks also impose an economic dependency among components since the system's downtime after one component fails is shared by all the components. Such economic dependency should be considered, which further makes the CBM challenging. Therefore, establishing a general CBM model that jointly incorporates component-wise dependencies and competing risks is necessary. Otherwise, the CBM planning could be inefficient and suboptimal, incurring higher operational and maintenance costs. Most applications of the traditional RL have been limited to domains where the features can be handcrafted or represented in low-dimensional state spaces. Therefore, directly applying the traditional RL to maintenance planning of K-component systems with complex component-wise interactions would be computationally inefficient and challenging. To overcome this challenge, Zhang et al. [12] proposed a novel and flexible CBM model based on a custom DRL for multi-component systems with dependent competing risks.
DRL is an approach in machine learning that blends reinforcement learning techniques with strategies for deep learning. This type of learning requires computers to use sophisticated learning models and look at large amounts of input in order to determine an optimized path or action. Their proposed CBM model for a K-component system is different from the existing models in two ways: 1. It jointly incorporates stochastic dependency, economic dependency, and competing for failure risks among components. 2. It completely excludes the concept of maintenance thresholds, which are key decision variables in conventional CBM policies. Specifically, the proposed model directly maps the multi-component degradation measurements at each inspection epoch to the maintenance decision space with a cost minimization objective, and the leverage of DRL enables high computational efficiency and thus makes the proposed model suitable for both low and high dimensional CBM problems.
They have shown that the system deterioration and maintenance process can be formulated as an MDP, and a Deep Q-learning (DQL) algorithm has been selected for the maintenance decisions making. The DQL is a value-based algorithm combining Qlearning and deep learning to approximate the Q-value function. In other words, the DQL is an alternative for Q-learning to solve RL problems with huge state and action spaces or when the state or action spaces are continuous. Specifically, the DQL algorithm aims to recognize patterns instead of mapping every state to its best action. The difference between Q-learning and DQL is illustrated in Figure5. The probl an infinite num used to solve t action dynamic  [6] oposed a DRL enance model to degradation ptions concern maintenance sche an MDP with algorithm was t maintenance loped models RL and DRL in modeling xisting work in literature has ds can provide icy for CBM fically, among more recently h as DQL [12] apter 8 of [18] roaches. Such A g path of e 1.

eduler set of ma
The syst The syst The system be