Real-time health prognosis and dynamic preventive maintenance policy for equipment under aging Markovian deterioration

(1)

Vol. 45, No. 15, 1 August 2007, 3351–3379

Real-time health prognosis and dynamic preventive maintenance

policy for equipment under aging Markovian deterioration

ARGON CHEN* and G. S. WU

Graduate Institute of Industrial Engineering, National Taiwan University, 1, Roosevelt Rd. Sec. 4, Taipei 106, Taiwan

(Revision received February 2006)

An often seen practice of preventive maintenance (PM) is to construct a machine’s reliability model based on its historical failure records. The reliability model is then used to determine the PM schedule by minimizing the machine’s long-run operation cost or average machine downtime. Machines in many hi-tech manufacturing sectors are using sophisticated sensor technologies to provide sufficient immediate online data for real-time observation of equipment condition. Not only is the historical data but also the real time condition now available for scheduling a more effective PM policy. This research is to determine an effective PM policy based on real-time observations of equipment condition. We first use the multivariate process capability index to integrate the equipment’s multiple parameters into an overall equipment health index. This health index serves as the basis for real-time health prognosis under an aging Markovian deterioration model. A dynamic PM schedule is then determined based on the health prognosis.

Keywords: Real-time equipment monitoring; Equipment health prognosis; Markov chain modeling; Condition-based maintenance

1. Introduction

With the improvement of manufacturing technology and the emergence of high-tech industries, there are more and more factory productivity challenges. A major challenge is the drastically increased investment and operation cost. Take the semiconductor fabrication industry for example. The cost of a typical 25,000 wafers-per-month 300 mm fab is expected to exceed US$2.0–3.0 billion. The huge factory investment triggers the urgent need to improve the operation effectiveness. Among various types of costs, equipment cost usually takes up the largest portion of capital investment. It is estimated that 75% of total investment will be attributed to equipment investment in a typical 300 mm fab. The utilization and effectiveness of the equipment have become extremely important for industries with intensive capital investment in equipment.

A number of Japanese industries are now implementing the concept of Total Productive Maintenance (TPM) to drive the improvement of manufacturing

*Corresponding author. Email: [email protected]

International Journal of Production Research

ISSN 0020–7543 print/ISSN 1366–588X onlineß 2007 Taylor & Francis

http://www.tandf.co.uk/journals DOI: 10.1080/00207540600677617

(2)

efficiency (Takahasi and Osada 1990). The TPM paradigm has been shown to greatly improve the maintenance procedure, and to reduce or eliminate setups, test procedures and idle time. Similar objectives can be found in recent concerns on Overall Equipment Effectiveness (OEE) of the plant (Leachman 1997). To sum up, increasing the utilization and reducing the operation cost of equipment have become the critical factors of enhancing the company competitiveness.

To improve the effectiveness and minimize the equipment downtime, an appropriate maintenance policy is critical. On the one hand, a frequent preventive maintenance (PM) schedule reduces the system’s unexpected breakdowns but also increases the maintenance cost. On the other hand, a less frequent PM schedule has less maintenance cost but increases the cost of unplanned breakdowns. How to determine an adequate policy of preventive maintenance is therefore an important issue of shop floor control.

A usual practice of PM policy in many industries is to replace parts or perform maintenance when the equipment’s running time reaches a pre-determined time length (Barlow and Hunter 1960, Blanks and Tordan 1986, Elsayed 1996). The advantage of this strategy is easy to follow. Nevertheless, it depends only on the equipment’s reliability model derived from historical performance instead of from the real-time condition of the equipment. Recent developments on sensors, chemical and physical nondestructive testing, and sophisticated measurement technologies have made possible real-time observations of the system performance via many kinds of on-line data (e.g. temperature, pressure, voltage, current, vibration, corrosion, fluid, etc.). Makis and Jardine (1991, 1992) proposed a condition-based maintenance (CBM) to incorporate the machine condition into the age-based reliability model. Using Cox’s proportional hazards model (PHM) (e.g. Cox 1975, Cox and Oakes 1984, Kumar and Klefsjo 1994), CBM is based on a failure rate which is a function of both the tool age and the tool conditions. The same research team led by Professor Jardine has gone on to develop a decision-making software and apply CBM to different types of components or machines in various industries (see Jardine et al. 1997, 2001, Banjevic et al. 2001, Lin et al. 2004). The state values of the tool condition considered in CBM, however, are limited to those stochastically increasing over time and those having nondecreasing effect on the hazard rate (section 3, assumptions 2 and 4 in Makis and Jardine 1991).

There also exists a huge body of literature on optimal maintenance policies for systems under Markovian or semi-Markovian deterioration. Good reviews in this topic can be found in Barlow and Proschan (1965), McCall (1965), Sherif and Smith (1981), Valdez-Flores and Feldman (1989) and Dekker (1996). Under the deterioration models, some papers are concerned with optimal PM policies (e.g. Kao 1973); some focus on optimal inspection schedules (e.g. Milioni and Pliska 1988) and others consider simultaneous optimization of both inspection and PM schedules (e.g. Yeh 1996). Nevertheless, the optimal inspection schedule would not be a concern when modern sensor technologies are widely used especially in high-tech manufacturing industries.

To characterize a machine’s health, various types of equipment data should be accounted for. With the modern sensors built in the advanced processing equipment, there are usually tens or even hundreds of data items collected. It is necessary to integrate these data items and develop a single integrated index. Not only does the index provide an easy reading for engineers to have a quick idea on the equipment’s

(3)

overall performance, it also serves as a basis for determining an optimal PM policy. Gertsbakh (1977) has proposed using Discriminant Analysis to find a linear combination function of machine parameters that best distinguishes between a ‘good’ machine and a ‘failed’ one. Similar to Discriminant Analysis, a linear combination function of parameters with the maximum contribution to the machine condition can be found through Principal Component Analysis or Singular Value Decomposition (e.g. Stamatis et al. 1992). In this paper, it is not our objective to compare and find the best health index. Rather, we attempt to follow the modern equipment monitoring approach (Chen et al. 1998) to evaluate the machine condition by observing the distribution of the machine parameters’ readings and comparing it against pre-defined machine specifications. A health evaluating method is thus proposed based on the ideas of multivariate process capability indices.

With a single health index available, we then attempt to propose a tool health prognosis method under an aging Markovian deterioration model of the equipment health. In using only the Markov chain to characterize a deteriorating machine, it is assumed that the transition probabilities are only state-dependent. That is, the probability to make transition to a less healthy state does not increase with the age. In the above-mentioned literature, the semi-Markovian model is often used to capture the aging effect by incorporating a random sojourn time in each state (see Kao 1973). Different from the semi-Markovian model, we introduce an aging factor that discounts the probabilities of transitions to healthier states while increasing the probabilities of transitions to less healthy states. We refer to this Markovian deterioration with aging effect as aging Markovian deterioration. With the equipment health prognosis, we can predict the behaviour of the equipment condition. The PM schedule with respect to either downtime or cost minimization can be then determined dynamically based on the predicted equipment health. We refer to it as dynamic PM policy.

As shown in figure 1, the upper dotted-line rectangle includes steps to build the offline models. Historical data are collected first. Some multivariate statistical methods, e.g. principle component analysis, and time series analysis (Chen et al. 1998), are applied to extract important data items in the data conversion/preprocessing step. The health index is then computed using the pre-processed equipment data. After computing the health index, we enter the second phase where an aging Markovian deterioration model is identified and estimated based on historical observations of equipment health data. By doing so, the long-run performance of the equipment is established and possibly predicted. Finally, an effective PM policy is developed accordingly. The lower dotted-line rectangle shows the online use of this proposed PM policy. The details will be stated in the rest of this paper.

2. Evaluation of equipment condition and equipment health index

The advanced equipment and sensor technologies have provided more and more immediate data to reveal a machine’s condition. However, such data have not been effectively analysed and put to use in practice due to their enormous volume and the lack of efficient analysis methods. If we can establish a mechanism to analyse these real-time data, the equipment condition can be then observed and evaluated in a more timely fashion. A ‘Health Index’ for the equipment is therefore proposed here.

(4)

2.1 Evaluation of equipment condition with multivariate data

Usually, equipment provides several types of data. To integrate multiple correlated data items into a single index, multivariate statistical methods are used in this research. The health index must take into account the information contained in a set of parameters, which are online observed from the machine. For example, temperature, pressure, voltage, current, vibration, corrosion, fluid, etc., are typical parameters collected from a machine. They are often cross-correlated. Hence, not only the variation of individual variables but also the co-variation among variables should be taken into consideration.

Figure 2 shows a typical real-time equipment monitoring scheme referred to as ‘Bull’s Eye’ scheme. Values of various machine data items are displayed simultaneously on a monitoring board. The board consists of 3 concentric circular regions with different colours: green for SAFE, yellow for WARNING, and red for DANGEROUS. The distance between an observed data point and the board’s centre represents the data item’s deviation from its preset target. The distribution of data points provides an easy reading of the equipment’s current operating condition.

Data conversion Data pre-processing Historical data of equipment condition Sample data of health index

Phase III: Dynamic PM model

Dynamic PM alarm Real-time

equipment data

PM decision Phase I: Computing health index

Phase II: Building prognosis model

Off-line model

On-line application

(5)

When the points are concentrated around the centre, it indicates a good overall health. In contrast, when data points are scattered over a wide area, it indicates a worrisome situation. Thus, the engineers can easily read the equipment status by examining the distribution of the data points. Mathematical treatment to create such a ‘Bull’s Eye’ monitoring scheme will not be reiterated in this paper and can be found in O’Sullivan et al. (1996) and Chen et al. (1998).

Using the idea of ‘Bull’s Eye’ and different from Gertsbakh’s (1977) approach, an overall equipment health index can be calculated by the distribution of the data points. Using this idea, a quality measure known as multivariate process capability indices (PCIs) can be utilized. Different from our objectives here, the multivariate PCIs are created to measure a process’s performance by examining resulting products’ quality characteristics. But the basic theory is also feasible for evaluating equipment capability. Let’s recall the basic idea of multivariate process capability indices as in equation (1) (Taam et al. 1991, Kotz and Johnson 1993).

Cp¼

volume of specification region

volume of region containing 99:73% of observation variate ð1Þ The basic assumption of above multivariate PCI is that the observations of machine parameters must follow multivariate normal distributions. This assumption may not be appropriate for all situations. Fortunately, many studies on data modeling, which removes the common cause or data pattern of equipment parameters, have been proposed. Most of the machine parameters can therefore be filtered with only white noise left (Chen et al. 1998). In fact, as long as the distribution approximately concentrates on the central region, the idea of multivariate PCI can be adopted. Based on equation (1), figure 3 shows a possible relation between the specification region and the performance region for two hypothetical quality characteristics X1and X2. The desired X1and X2are to have

a target at [0, 0]T, covariance equal to 0, and both variances equal to 1. The actual

(6)

performance, of which 99.73% is shown in the shaded elliptical region, shows elliptical region with the covariance equals 0.6 and covers a much smaller area as compared to the specification region.

Assume that there are important machine parameters, X1, X2, . . . , X.

When the machine is running normally, the machine parameters will follow a multivariate normal distribution with a mean vector equal to target settings T and covariance matrix A. By replacing the product quality characteristics in the multivariate PCI with machine parameters, we define a Machine Capability Index (MCI) as follows MCI ¼volume of ðX T ÞTA1ðX T Þ K2 volume of ðX XÞTV01 0 ðX XÞ 2, 0:9973 ð2Þ where

K is used to adjust the size of the specification region;

X is the vector of sample observations of machine parameters;

X is the mean vector of sample observation X; V0 is the observed covariance matrix of X;

V0

0 ¼V0þ ð X T Þð X T ÞT; and

(7)

2_{, 0:9973} is the 99.73th percentile of 2distribution with equal to the number of machine parameters.

In figure 3, the specification region is formed by: K ¼6, T ¼ 0 0 , and A ¼ 1 0:8 0:8 1

while the performance region is formed by:

2_{2, 0:9973}¼11:83, X ¼ T, and V0 0¼V0¼ 1 0:6 0:6 1 :

With the above specifications and actual machine performance, the MCI in equation (2) can now be calculated as:

MCI ¼ volume of ðX T Þ T_A1_{ðX T Þ}₆2 volume of ðX XÞTV1 0 ðX XÞ 11:83 ¼ 6 2₌pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi_detðA1_Þ 11:83= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi detðV1 0 Þ q ¼ 6 2₌pffiffiffiffiffiffiffiffiffiffiffi_2:778 11:83=pffiffiffiffiffiffiffiffiffiffiffi1:563¼2:28:

Figure 4 shows a machine’s performance region with its centre, i.e. X, shifted from [0, 0]Tto [3, 0]T.

(8)

In the case of figure 4, V0

0 is no longer the same as V0:

V0₀¼V0þ ð X T Þð X T ÞT¼ 1 0:6 0:6 1 þ 3 0 3 0 ¼ 10 0:6 0:6 1 : The calculated MCI, thus, becomes much smaller:

MCI ¼ volume of ðX T Þ T A1ðX T Þ 62 volume of ðX XÞTV01 0 ðX XÞ 11:83 ¼ 6 2₌pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi_detðA1_Þ 11:83= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi detðV01 0 Þ q ¼ 6 2₌pffiffiffiffiffiffiffiffiffiffiffi_2:778 11:83=pffiffiffiffiffiffiffiffiffiffiffiffiffiffi0:1037¼0:588:

It should be noted that in equation (2) specification region defined by T, A and K is assumed known in this paper. In reality, to estimate and obtain such a specification region requires prior engineering knowledge and understanding of equipment behaviour from continuous observations. Interested readers can refer to Taam et al. (1991) and Chen and Tsai (2004) for the details of establishing an elliptical engineering specification region.

2.2 A probable health index

One of our purposes for establishing the health index is to provide a numerical indicator of the equipment’s condition. That is, we like to grade the machine’s performance based on real-time evaluation of the machine condition. First, let’s consider the properties of the multivariate PCI proposed earlier to evaluate the machine condition. For a process to be in a working condition, the performance region containing 99.73% of observations should fall inside the specification region. Namely, a process with the PCI value less than 1 would be deemed incapable. For the process capability to be acceptable, it is generally said that the PCI value should reach at least 1.33. Moreover, PCI ¼ 2 implies that the volume of the specification region is twice as large as the volume of the performance region with the mean equal to the target. That is, the specification region contains almost all possible observations and with only a very slim chance that the equipment will perform out-of-spec. In the univariate case, when PCI ¼ 2 there would be only a chance of 0.002 in one million that the equipment breaks down, i.e. out of the specification region (Kotz and Johnson 1993). Therefore, a process with PCI 2 is said to be in an excellent condition. From a barely working condition at PCI ¼ 1 to an acceptable condition at PCI ¼ 1.33 and a superb condition above 2, the PCI scale, though familiar to most engineers, is ranging from 0 to 1 and is not linearly reflecting the equipment health.

To provide a more comprehensible view on the equipment condition, the MCI proposed in the previous section can be translated into a score in a linear scale with a closed range, such as [0, 100]. Here we propose a possible transformation function to transform the MCI value in (0, 1) to a score in (0, U) with 0 and U representing the worst and the perfect machine conditions, respectively:

H ¼ 1 e ðMCI=dÞ

(9)

The shape of the mapping function in equation (2) can be determined by adjusting the value of a and d based on the engineers’ own engineering judgment. aand d are parameters used to adjust the shape of the mapping function. Generally speaking, a affects the location and d affects the width and slope.

Here, we suggest a setting of (a, d) ¼ (1.04, 0.32). Figure 5 shows the curve of this mapping function with U ¼ 100 representing a perfect machine condition. Table 1 lists the corresponding values of MCI and health index under this setting. A function with this setting translates MCI ¼ 1.33 to about H ¼ 70 and MCI ¼ 2 to about H ¼ 95.

It has to be noted that one may define one’s own transformation function to translate the MCI value to any closed interval as long as engineers’ view of machine condition can be faithfully reflected. Since PCI itself is already a familiar measure to most engineers, using MCI directly as a health index without any transformation is also plausible. Irrespective of the transformation methods, the health prognosis models and dynamic PM policies proposed in the coming sections can be still applied. 10 20 30 40 50 60 70 80 90 100 0 0 0.5 1 1.5 2 2.5 3 MCI Health Index

Figure 5. Mapping function—(a, d) ¼ (1.04, 0.32) for a (0, 100) score range.

Table 1. Data list of (2–8) with setting (a, d ) ¼ (1.04, 0.32)

MCI Health index

0 0.00 0.5 12.34 1 44.82 1.2 60.78 1.33 70.11 1.5 80.06 2 95.07 3 99.77 4 99.99

(10)

3. Equipment health prognosis model

In the previous section, we proposed a health index to evaluate the real-time equipment conditions. With the proposed health index, a machine with a lower score is more likely to fail than a machine with a higher score. For the purpose of determining an appropriate PM schedule, the tendency towards failure needs to be modeled. Here, we propose a failure prognosis model, which can help us to predict the machine conditions for the time to come.

3.1 Equipment health prognosis under Markovian deterioration

In practice, the equipment health based on real-time parameters variability is subject to many noises in the manufacturing environment. To make robust PM decisions based on the health index, the health index should be further classified into a few discrete states. For instance, the Bull’s eye scheme classifies the equipment health into only three states (green, yellow and red). For three equipment states, a machine with MCI above 2.0 (health score495), generally considered as ‘good’, can be classified as ‘Green’ while the machine with MCI value below 1.0 (health score545) is said to be in the ‘Red’ state (or failure state). The ‘Yellow’ state can be then defined for MCI values in between 1.0 and 2.0. It is often an engineering call to determine the number of equipment states. Although it is not the focus of this paper, the engineers should be aware that the number of states affects the sensitivity and robustness of PM decisions and the engineering judgment should be made to strike a balance.

Suppose now the health score has been classified into n discrete states: F, 1, 2, 3, . . . , n1. Since the equipment health state can be evaluated at each sampling time point, it can be viewed as a stochastic process: H ¼ {Ht: t 0}. If Ht¼i, the

equipment is said to be in state i at time t. We assume here that when the process is in state i, there is a fixed probability Pi,jthat the health index will be in state j at the next

time point. We characterize such a stochastic process using a Markov chain model. For a Markov chain, the conditional distribution of any future state Htþ1given the

earlier states H0, H1, . . . , Ht1becomes (Ross 2000)

PðHtþ1¼jjHt¼i, Ht1 ¼it1, . . . , H1¼i1, H0 ¼i0Þ ¼PðHtþ1¼jjHt ¼iÞ ¼ Pi, j We first establish a condition-based prognosis model based on Markov chain theories. Let denote the matrix of one-step transition probabilities Pi,j:

F 1 2 . . . n 1 ¼ F 1 2 .. . n 1 PF, F PF, 1 PF, 2 . . . PF, n1 P1, F P1, 1 P1, 2 . . . .. . . . . .. . . . . Pn2, n1 Pn1, F Pn1, 1 Pn1, 2 . . . Pn1, n1 2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5 ð4Þ

wherePn1_j¼FPi, j ¼1: Pi,jis the probability that H transits to state j given its current

(11)

stay in the failure state until it is repaired. The failure state is also known as an absorbing state:

PF,F¼1, PF,j¼0, for j ¼ 1, 2, . . . , n 1

The two-step transition probability can be obtained by taking the square of . For instance, the (i, j)th entry of 2is the probability to be at state j after two periods of time, given the current state at i. Recursively, we can calculate the probability for the condition of the equipment to be at certain state after any number of time periods.

To further interpret the matrix , each row, say row i, of represents a state probability distribution given the current state at i. This conditional probability distribution will likely form a bell shape. For example, in the beginning of operation, His most likely to be at a good-condition state, say state i, and remain at state i at the next available time point. H is less likely to move to other states. Thus, Pi,i is

likely to be the highest probability among Pi,j’s (see figure 6).

For example, let the current equipment health state be 80. Then, at the next time point the equipment health has the greatest possibility to stay at state 80. When the health state does change, the probability to move to state 70 will be greater than the probability to move to state 60. Likewise, the probability to move to state 90 should be greater than that to move to state 100. This leads to a bell-shape probability distribution as illustrated in figure 6. The above Markov-chain model of the equipment health does not explicitly consider the equipment’s age although equipment aging may be exhibited through the stochastic evolution of the health state. However, with the constant , when a machine arrives at a state, the transition probability distribution to move to other states is always the same no matter how old the machine has grown. That is, a machine could stochastically arrive at the same state at different ages but the deterioration ‘tendency’ is exactly the same given the same health state. In the following section, we will propose an aging Markovian model in which even starting with

F 1 2 3 4 …. n-1

Value of

Pi,

j

The highest value: P_i,i

Health state j

(12)

the same health state the machine becomes more likely to enter into less healthy states as it grows older.

3.2 Equipment health prognosis under aging Markovian deterioration

An important property of the above Markovian model is the bell-shape probability distribution presented by entries in the same row of . But as the machine grows older the health tends to worsen due to deterioration of the equipment; i.e. the probability for the machine to become healthier will decrease while the probability of becoming less healthy will increase. Let a larger value of the state correspond to a higher health score. We can then observe that each row of matrix , i.e. the conditional probability distribution, should act like a moving wave as the equipment ages. Let’s observe the wave motion in figure 7. Stochastically, a machine could visit the same state i at different ages, the machine at a higher age should have higher probabilities to become less healthy while the probabilities to become healthier should decrease. The moving-wave effect is the result of the machine’s deterioration over its run time. The longer the machine has been run, the worse the machine’s health ‘tendency’. We extend the Markovian model to include this aging effect and call it an aging Markovian deterioration model.

Suppose tis the transition probability matrix at age t and Pi,j(t) be the entry of

t representing the transition probability from state i to state j at t. Also, let

Pi,MðtÞ ¼maxjfPi,jðtÞ, j ¼ F, 1, 2, . . . , n 1g, where M ¼arg maxjfPi, jðtÞ, j ¼ F, 1, 2, . . . , n 1g, represent the peak transition probability in the bell-shape distribution. Denote the left-hand side and right-hand side cumulated probabilities as PL_iðtÞ ¼PM1_j¼F Pi, jðtÞ and PRiðtÞ ¼ Pn1 j¼MPi, jðtÞ, respectively. Then, PRiðtÞ ¼ 1 PL iðtÞ since Pn

j¼1Pi, jðtÞ ¼1: As shown in figure 7, when the machine becomes older, PL

i increases while PRi decreases. That is, PLiðtÞ PLiðt þÞ and PR

iðtÞ PRiðt þÞ where is the time interval between the health observation at t and the next available observation. Assuming further that the increase

Value of

Pi,j

Age increases

F 1 2 3 4 …… M ….. n-1

Health state j

(13)

(or decrease) in Pi,j(t) is proportional to the fraction Pi,j(t) takes in PL_iðtÞ(or in PR_iðtÞ), we have Pi, jðt þÞ Pi, jðtÞ ¼ Pi, jðtÞ PL iðtÞ PL_iðt þÞ PL_iðtÞ for j5M ð5aÞ and Pi, jðtÞ Pi, jðt þÞ ¼ Pi, jðtÞ PR iðtÞ PR_i ðtÞ PR_iðt þÞ for j5M: ð5bÞ

We define an aging factor 2 (0, 1) as PL iðt þÞ PLiðtÞ PL iðtÞ ¼P R iðtÞ PRi ðt þÞ PL iðtÞ ¼ for 8i and 8t: ð6Þ

represents the increase rate in PL

iðtÞfor a machine becoming older and remains constant for all states over the entire lifetime of an aging machine. We can now derive an age-dependent model to describe the wave-moving probability distribution from equations (5a), (5b) and (6). Given the current age t, the transition probability at the next available time t þ is,

Pi, jðt þÞ ¼ Pi, jðtÞ þ Pi, jðtÞ PL iðtÞ PL_iðtÞ ¼ Pi, jðtÞ ð1 þ Þ for j5M ð7aÞ and Pi, jðt þÞ ¼ Pi, jðtÞ Pi, jðtÞ PR iðtÞ PL_iðtÞ ¼ Pi, jðtÞ 1 PL iðtÞ PR iðtÞ for j M: ð7bÞ

This function is derived from the principle that Pi,j should decrease for j M and

increase for j5M. can be viewed as an aging factor and also should be estimated from historical equipment data. The above equations are developed to describe the wave-moving phenomenon and ensure that Pi,j remains 1. Starting

with the initial transition probability matrix at t ¼ 0 (0), the values of the entries

in t evolve according to equations (7a) and (7b) as the machine age (t) grows.

The equipment heath prognosis can be thus established under this aging Markovian deterioration.

3.3 Parameter estimation

Both the aging factor and the initial transition probability matrix 0need to be

estimated. We first find a estimator for an aging machine given . An MLE (Maximum Likelihood Estimation) method is proposed here. The problem is to find a good estimator ^ given observed values of a random sample x1, x2, . . . , xn

(Hogg and Tanis 1996). Now we assume there are n independent failure events that occurred at times ti, i ¼ 1, 2, . . . , n. For each failure event, say the failure at ti,

(14)

by Sðt iÞ: Let the failure occur between the kith and (kiþ1)th samples;

i.e. ki ti5(kiþ1) and denotes the fixed time interval between two

consecutive health observations. Thus, the probability that SðtiÞtakes a particular path sðtiÞ is Pr T ¼ ti, SðtiÞ ¼ sðtiÞj ^ ¼Pr ki ti5ðkiþ1Þ, S1¼sðiÞ1, . . . , Ski ¼s ðiÞ kij ^ ¼Pr S1¼sðiÞ1 , . . . , Ski ¼s ðiÞ ki, Skiþ1¼Fj ^ ¼Pr s ðiÞ_k_i_þ1¼FjsðiÞ_k_i, ^ Y ki m¼1 Pr s ðiÞ_mjsðiÞ_m1, ^

where S_jðiÞ, j ¼ 1,2, . . ., ki, is the jth state in the particular path SðtiÞ leading to the failure state F at ti. To rearrange it,

Pr½SðtiÞ ¼sðtiÞj ^ ¼ Y kiþ1 m¼1 P_sðiÞ m1, s ðiÞ mj ^ ð8Þ where P_sðiÞ m1, s ðiÞ

mj ^ denotes the transition probability that state changes from s

ðiÞ

m1to sðiÞm

given that the aging factor ¼ ^:With n failure sample paths, the likelihood function can be written as follows.

L ¼Y n i¼1 Prð SðtiÞÞ ¼ Yn i¼1 Y kiþ1 m¼1 P_sðiÞ m1, s ðiÞ mj ^ ð9Þ

and the log-likelihood function is:

log L ¼X n i¼i X kiþ1 m¼1 log P_sðiÞ m1, s ðiÞ mj ^ ð10Þ

Hence, the estimator of can be found by maximizing the log-likelihood function. That is:

^

¼arg max ^

flog Lg ¼ arg max ^ Xn i¼i X kiþ1 m¼1 log P_sðiÞ m1, s ðiÞ mj ^ ( ) : ð11Þ

Since 2 (0, 1), numerical methods to search over the range (0, 1) can be employed to find the optimal ^_:_{In figure 8, log L as a function of ^}_{for ¼ 0.002,} 0.008 and 0.01 are illustrated. Figure 8a shows the log-likelihood function for 0 ^ 1, while figure 8b shows the function in a smaller interval: 0 ^ 0:04: From figure 8, it can be conjectured, although not rigorously proven, that the log-likelihood function is a unimodal function of ^ and thus ^ _{can be found rather} effectively through numerical search.

(15)

3.4 Example

A numerical example is used to illustrate the construction of the aging Markovian prognosis model. Suppose H570 (or MCI51.33) is regarded as an absolutely dangerous condition for the equipment. That is, when the health index has a score of less than 70, the machine is deemed to be in a failure state. We now divide H¼[0, 100] into 11 states as shown in table 2.

log L δ = 0.002 δ = 0.008 δ = 0.01 (a) δ = 0.002 δ = 0.008 δ = 0.01 (b) 0≤ dˆ ≤ 1 0≤ dˆ ≤ 0.04 L Figure 8. log L vs. ^:

(16)

An initial transition probability matrix, 0 representing the stable equipment

health stochastic process, is assumed to be:

F 1 2 3 4 5 6 7 8 9 10 0¼ F 1 2 3 4 5 6 7 8 9 10 1 0 0 0 0 0 0 0 0 0 0 0:05 0:35 0:23 0:17 0:1 0:05 0:031 0:01 0:005 0:003 0:001 0:01 0:1 0:35 0:2 0:14 0:1 0:05 0:032 0:01 0:005 0:003 0:005 0:01 0:105 0:35 0:2 0:14 0:09 0:05 0:035 0:01 0:005 0:003 0:005 0:01 0:092 0:35 0:2 0:14 0:1 0:05 0:04 0:01 0:001 0:003 0:005 0:01 0:091 0:35 0:2 0:14 0:1 0:07 0:03 5e 4 0:001 0:003 0:005 0:01 0:0805 0:35 0:25 0:17 0:08 0:05 2e 4 6e 4 0:001 0:003 0:005 0:01 0:0802 0:47 0:25 0:1 0:08 1e 4 3e 4 6e 4 0:001 0:003 0:005 0:01 0:08 0:6 0:2 0:1 8e 5 1:2e 4 3e 4 5e 4 0:001 0:003 0:005 0:01 0:08 0:6 0:3 2e 5 8e 5 1e 4 3e 4 5e 4 0:001 0:003 0:005 0:01 0:08 0:9 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5

Using this 0and ¼ 0.005, we simulate 20 sample paths of health index scores, each

path is of size 90. Figure 9 illustrates the first 5 paths.

As we can see, the health scores are roughly decreasing over time. We use these 20 sample paths as sample data to estimate . To estimate the performance of this MLE estimate of the aging factor, we repeat the simulation and estimation five times to obtain an average estimate ^ ¼0:00504 with quadratic estimation loss ¼ 1.064e07 from the simulated data. In table 3, different values of are used to simulate new sample paths and the same estimation method is applied to obtain the maximum likelihood estimation of . It can be seen that with 20 sample paths and 90 sample scores in each path, the estimate using MLE is quite robust for different values of (0.01–0.0005).

We now estimate the initial transition probabilities in 0. As shown below, there

are n n entries in the matrix.

F 1 2 . . . n 1 ¼ F 1 2 .. . n 1 PF, F PF, 1 PF, 2 . . . PF, n1 P1, F P1, 1 P1, 2 . . . .. . . . . .. . . . . Pn2, n1 Pn1, F Pn1, 1 Pn1, 2 . . . Pn1, n1 2 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 5

Table 2. Health index and its corresponding states (example).

Health score State Health score State

570 Failure 85–87 6 70–72 1 88–90 7 73–75 2 91–93 8 76–78 3 94–96 9 79–81 4 97–100 10 82–84 5

(17)

Intuitively, it is not hard to obtain the estimators. Denote ni,jas the number of

one-step transitions from i to j in available sample paths. The estimators of Pi,jare

given by: ^ Pij¼ ni, j ni where ni¼ P

j2Sni, j and is the total number of state i in sample paths. Ideally, 0

should be estimated using sample paths of machines before aging, i.e. machines with the constant hazard rate during a stable-reliability period as in the well-known bath tub reliability model. However, if aging starts early, sample paths available for estimating the initial transition probabilities may be limited. Given an aging sample path and , we need to propose an estimate for 0. Let m(l), l ¼ 1, . . ., ni, be the

machine ages at which the health states are i in an aging sample path. From equation (7a), we can obtain the conditional expected number of transitions from state i to state j for j5M:

Eðnumber of transitions i ! jjsmð1Þ¼i, smð2Þ¼i, . . . , smðniÞ¼iÞ

¼X

ni

l¼1

Pijð0Þð1 þ ÞmðlÞ for j5M:

Figure 9. Five sequences of health state sample path.

Table 3. Estimation performance for different .

Actual Average of 5 ^

Sample mean squared estimation errors

0.01 0.00946 3.54e7

0.005 0.00504 1.06e7

0.001 0.00108 1.06e8

(18)

Therefore, the initial left-hand-side transition probabilities can be estimated by: ^ Pij¼ ni, j Pni l¼1ð1 þ Þ mðlÞ for j5M: ð12Þ

The left-hand side and right-hand side cumulated probabilities can then be estimated by: ^ PL_ið0Þ ¼ X M1 j¼F ^ Pi, j; ^PLiðmÞ ¼ ^PLið0Þð1 þ Þ m ; and ^PR_iðmÞ ¼1 ^PL_iðmÞ ð13Þ

From equation (7b), we can obtain the conditional expected number of transitions from state i to state j for j M:

Eðnumber of transitions i ! jjsmð1Þ¼i, smð2Þ¼i, . . . , smðniÞ¼iÞ

¼X ni l¼1 Pijð0Þ Y mðlÞ h¼1 1 þP L iðhÞ PR iðhÞ for j M

and thus we obtain the estimates for the initial right-hand side transition probabilities: ^ Pij¼ ni, j Pni l¼1 QmðlÞ h¼1 1 þ ^ PL iðhÞ ^ PR iðhÞ for j M ð14Þ

With equations (11)–(14), we propose the following iterative procedure to estimate and Pijsimultaneously:

Step 1. Give an initial guess of ^:

Step 2. Use equations (12)–(14) to estimate ^Pijgiven the latest ^: Step 3. Use equation (11) to re-estimate ^given ^Pij estimated in Step 2.

Step 4. Check if ^has converged to a predetermined accuracy. If yes, stop. If not, go back to Step 2.

Since this paper’s focus is not on the estimation quality, the convergence of the above iterative procedure is not proven and should be an interesting subject for future study.

4. Dynamic preventive maintenance policy

The ultimate goal of this research is to develop a dynamic PM policy based on real-time equipment data. The main tasks of establishing a dynamic PM scheme are illustrated in figure 10. The first and second rectangles have been developed in sections 2 and 3, respectively. Our next task focuses on the third rectangle, that is, to develop an efficient dynamic PM policy. Here, we will introduce a dynamic PM scheme based on the health index and equipment’s condition prognosis model. Then, we will explain why it is an efficient method. How the dynamic PM

(19)

methodology can be applied in practice will be also presented at the end of this section.

4.1 Preventive maintenance optimization models

In this paper, we consider only full-scale PMs that completely renew the equipment condition. Generally, when planning a preventive maintenance policy, the goal is to achieve two aspects of production efficiency: minimum equipment maintenance cost or minimum equipment downtime. Conventional PM models are all developed to minimize the equipment maintenance cost or downtime. Nevertheless, these conventional models (e.g. chapter 9 of Elsayed 1996) do not take into consideration the real-time condition of the equipment. PM policies are determined only based on the machine’s earlier failure records. Modern sensor technology has made real-time equipment operation data available for monitoring and analysis. A more timely, adaptive PM decision can be now made in real time according to the on-line collected equipment data. This is why the word ‘dynamic’ is used, i.e. to perform the maintenance dynamically based on the equipment’s current condition to minimize the maintenance cost or the machine’s downtime ratio.

4.1.1 Cost minimization model. When minimizing the cost, we often take the long-term average of total maintenance cost as the objective cost function to be minimized. The main maintenance cost of the equipment comes from the preventive maintenance and the breakdown repair. The cost of a failure repair is usually much

Equipment Condition Evaluation (Health Index)

Real-time equipment data

PM decision Equipment Condition Prognosis

Dynamic PM Scheme

(20)

higher than that of performing a PM. Sufficient PMs can reduce the possibility of equipment breakdown but increase the PM cost. Therefore, a trade-off exists between the PM cost and the failure cost. A metric to evaluate the efficiency of a PM policy is the expected maintenance cost per unit time.

Suppose the equipment data are acquired at scheduled sampling times , 2, 3, . . ., m, . . ., where is a constant time interval. At these specific time points, a PM decision (to perform a PM or do nothing) is made. Since the health state transition probabilities are age-dependent, mþ1must be calculated from musing

the aging Markovian model presented in section 3.2. Notice that our transition probability matrix is time dependent and is therefore a nonstationary Markov chain. That is, the equipment’s health has a tendency to worsen as it becomes older.

For a stationary Markov chain, the n-step transition probabilities (n) can be obtained from n powers of one-step transition probability matrix, i.e. (n)¼n. But in the aging Markovian prognosis model, the n-step transition probabilities are different at different observation points. At any given period m, the n-step transition probabilities are calculated by

ðnÞ_m ¼mmþ1 mþn1: ð15Þ We can now determine the expected cost per unit time, denoted by based on the renewal theory. If we decide to perform a PM after k periods (i.e. PM at (m þ k) ) as shown in figure 11, given a health index state s at the mth period, the expected cost per unit time is.

Eððm, kÞjsÞ ¼expected cost per cycle expected cycle time ¼

C þ K Prfk4TðmÞjsg

E ½m þ minðk, TðmÞÞjs ð16Þ where C is the cost of performing a PM, K is the additional cost for a breakdown repair, i.e. the cost of a breakdown repair is C þ K. T(m) is the Time To Failure (TTF) from the current time period m. Pr{k4T(m)|s} is the probability that the equipment breaks down before the PM at period k is performed given the current state is s. A maintenance cycle is terminated by either the PM or the equipment breakdown. Therefore, the numerator in equation (16) represents the expected cost per cycle and the denominator E[min(k, T(m))|s] denotes the expected cycle length. Our objective is then to find an optimal k at any given period m that minimizes E()m, k(s).

In order to calculate the expected cycle length, we again use the Markov chain theory. Define the following two variables:

f_ijðnÞðmÞ: the probability that the equipment health goes from state i at period m to state j for the first time at period m þ n

0∆ 2∆ …… mD (m +k)D

kD

PM to be performed

Time Previous PM

(21)

PðnÞ_ij ðmÞ: the probability that the equipment health state starts from i at period m and becomes j at period m þ n

If F presents the state of machine failure and is therefore an absorbing state, it can easily be found that

f_iFðnÞðmÞ ¼ PðnÞ_iFðmÞ Pðn1Þ_iF ðmÞ for n > 1 and

f_iFð1ÞðmÞ ¼ Pð1Þ_iFðmÞ for n ¼ 1

where PðnÞ_iFðmÞ is an entry in ðnÞ_m:From the above definitions, we can immediately obtain:

Prfk4TðmÞjsg ¼ PðkÞ_sFðmÞ:

Suppose we plan a PM after k periods given the current state s at period m. E[min(k, T(m))|s] represents the expected cycle length. The cycle length is k if the cycle is ended by a PM and is T(m) if the cycle is ended by an equipment breakdown. The probability that the equipment is down after k periods is PðkÞ_sFðmÞ, which can be obtained from ðkÞ

m:We can derive: E ½m þ minðk, TðmÞÞjs

¼m þ k PrfTðmÞ4kjsg

þ Prfequipment fails for the first time after 1 periodjsg þ þk Prfequipment fails for the first time after k periodjsg

¼m þ k ð1 PðkÞ_sFðmÞÞ þ f_sFð1ÞðmÞ þ2 f_sFð2ÞðmÞ þ þ k fðkÞ_sFðmÞ ¼m þ kð1 PðkÞ_sFðmÞÞ þPð1Þ_sFðmÞ þ2ðPð2Þ_sFðmÞ Pð1Þ_sFðmÞÞ þ þ kðPðkÞ_sFðmÞ Pðk1Þ_sF ðmÞÞ ¼m þ k X k1 l¼1 PðlÞ_sFðmÞ for k41 and E ½m þ minðk, TðmÞÞjs ¼ ðm þ 1Þ for k ¼ 1: ð17Þ Here, we impose that the decisions (PM, breakdown repair, or do nothing) are made and the equipment failures are observable only at discrete time points , 2, 3, . . . , m, . . . .

We have determined both the numerator and denominator in equation (16). Given the current state s at time m, if a PM is planned after k periods, the expected cost per unit time now becomes:

Eððm, kÞjsÞ ¼ C þ K P ðkÞ

sFðmÞ

(22)

where let Pð0Þ_sFðmÞ ¼0: Equation (18) is then the objective cost function to be used to determine the optimal PM policy. It should be noted that a cost function, commonly used in machining economics problems (see, for example, Kaspi and Shabtay 2003), calculating the expected cost per manufactured item, instead of the expected cost per unit time, could also be formulated as the objective function. In the formulation of the expected cost per item, the maintenance cost becomes the manufacturing overhead in addition to the unit manufacturing cost. It can be easily shown that minimizing the expected cost per item is equivalent to minimizing equation (18) when the product quality cost is not considered and the processing time per item and the manufacturing cost per item are both constant.

4.1.2 Downtime minimization model. Cost estimation may be an issue for the cost minimization model. Downtime minimization model then serves as an alternative for obtaining the dynamic PM policy. There are two types of equipment downtime: downtime for performing PM and downtime for breakdown repair. The downtime for breakdown repair is usually much longer than that of performing a PM. Similar to the cost minimization, we should determine an optimal PM schedule by minimizing the long-term average downtime fraction. Let denote the fraction of equipment downtime.

EðÞ ¼Total expected downtime per cycle

Expected cycle length ð19Þ

Equation (19) expresses the expected downtime fraction. Assume R is the downtime when performing a PM and D is the additional idle time needed for a breakdown repair (i.e. the total downtime due to the machine’s breakdown is R þ D). Similar to equation (18), the numerator in (19) can easily be found to be R þ D PðkÞ_sFðmÞ: The expected cycle length is, however, different from the cost model in section 4.1.1. In a cost minimization model, the downtime due to PM or breakdown repair is usually neglected.

As shown in figure 12, the PM and breakdown repair downtimes should be now taken into consideration in the expected cycle length. Suppose we are at time period mwith the health state equal to s and a PM is planned to perform after k periods.

PM PM PM PM Failure repair a cycle a cycle downtime uptime Time

(23)

The expected cycle length is therefore: E ½m þ minðk þ R, TðmÞ þ R þ DÞjs

¼m þ ðk þ RÞ PrfTðmÞ þ R þ D4k þ Rjsg

þ ð þ R þ DÞ Prfequipment fails for the first time after 1 periodjsg þ þ ðk þ R þ DÞ Prfequipment fails for the first time after k periodjsg ¼ ðk þ RÞ ð1 PðkÞ_sFðmÞÞ þ ð þ R þ DÞ f_sFð1ÞðmÞ þ2 f_sFð2ÞðmÞ þ þ ðk þ R þ DÞ f_sFðkÞðmÞ ¼ ðk þ RÞð1 PðkÞ_sFðmÞÞ þ ð þ R þ DÞPð1Þ_sFðmÞ þ ð2 þ R þ DÞðPð2Þ_sFðmÞ Pð1Þ_sFðmÞÞ þ þ ðk þ R þ DÞðPðkÞ_sFðmÞ Pðk1Þ_sF ðmÞÞ ¼R þ k X k1 l¼1 PðlÞ_sFðmÞ þ DPðkÞ_sFðmÞ for k > 1 and E ½m þ minðk þ R, TðmÞ þ R þ DÞjs ¼ ðm þ 1Þ þ R þ DPsFðmÞ for k ¼ 1 ð20Þ Therefore, given the current state s and the current age m, the expected downtime fraction is shown in equation (21) if a PM is planned after k periods.

Eððm, kÞjsÞ ¼ R þ D P

ðkÞ

sFðmÞ

m þ R þ ðk Pk1_l¼0 PðlÞ_{s, F}ðmÞÞ þ D PðkÞ_{s, F}ðmÞ for k 1 ð21Þ In addition, the machine’s expected utilization rate can be expressed as 1 E [(m, k)|s].

4.2 Optimal PM decisions

The score of the health index reflects the condition of the equipment’s health. In order to determine when the machine needs a PM given a real-time score of the health index, we should find a threshold for the health index scores. When the observed score exceeds this threshold, a decision is made to perform a PM at the next available time.

Suppose that

. the current time, i.e. the machine age, is m

. the current state is s (i.e. the health index is corresponding to state s); . a planned PM is schedule to perform after k periods of time (i.e. PM at time

(m þ k) ).

The expected cost per unit time is E[(m, k)|s] and the expected downtime fraction is E[(m, k)|s], which are calculated from equations (18) and (19), respectively.

Given the time point m and the health index state s, there exists k*(s, m) that minimizes E [(m, k)|s] or E [(m, k)|s].

Eððm, kÞjsÞ ¼min

(24)

or

Eððm, k_{ÞjsÞ ¼}_min

k Eððm, kÞjsÞ ð22Þ

Since we want to minimize the expected cost per unit time, a minimum cost PM decision is therefore: if the minimum cost appears at k*(s, m)41, it implies that to plan a PM at k*(s, m)41 will attain a lower average cost. Then, we should not perform a PM at next decision making time. If the minimum cost appears at k*(s, m) ¼ 1, we should perform a PM right away. This rule will be used to construct a dynamic PM policy.

If the state space is of size 11 and the state space is S ¼ {F, s1, s2, . . . , s10}, there

will be ten k* values, k*(s1, m), k*(s2, m), . . . , k*(s10, m), at time m for each S except

t=1 Calculate E[m(m,k)|s] or E[r(m,k)|s] t=t+1 No PM alarm boundary Step 1. Found k*(t,s_i) for all s_i’s Step 2. Calculate s*(t) = max{s_i_{: k*(s}_i_,t)=1} k*(t,s_i) s*(t)

Is the time horizon through t long enough?

Yes Step 3. Form the boundary:

s*(1),s*(2),s*(3),…, s*(t),…

(25)

for the failure state F. Suppose a larger value of S indicates a healthier equipment condition, k*(si, m) is then a nondecreasing function of si. Let

s_{ðmÞ ¼}_maxfs

i: kðsi, mÞ ¼ 1g ð23Þ That is, s*(m) denotes the healthiest equipment state among the equipment states that satisfy k*(s1, m) ¼ 1at time m. Then, s*(m) is the threshold for performing the

PM. When a health index is calculated at time m, the score is first converted to a state value (e.g. table 2). The state value is then compared to the threshold state s*(m). If the observed state value is lower than s*(m) at time m, the decision will be to perform the PM right away.

The above decision is made at time m. Given any time point t, there exists a threshold state s*(t). A sequence of threshold states s*(t), t ¼ 1, 2, 3, . . . , can be then calculated at time points , 2, . . . , t, . . . and forms a PM alarm boundary. Figure 13 shows the steps in a flow chart.

The PM alarm boundary is actually the least tolerable value of the equipment health. That is, any observed health state lower than this boundary will be regarded as an alarming situation where a PM is urgently needed.

Figure 14 shows a typical PM alarm boundary in dotted line and a sample path of actual health state observations in solid line. As can be seen, dynamic PM decisions can be easily made by monitoring the trend of the actual health index observations. Once the trend goes below the alarm boundary, a PM should be performed.

4.3 Example

In this section, we use an example to explain how to calculate the alarm boundary from the equipment condition prognosis model. The prognosis model with an aging factor ¼ 0.025 presented in the example of section 3 is used. Take the downtime minimization model for example. We can calculate the expected downtime fraction for different sets of values of k, m, and s using equation (21).

Table 4 lists the expected values of downtime fractions (%) (R ¼ 10 and D ¼ 100) for t ¼ 51 and 52. For t ¼ 51 in table 4, the minimum expected downtime calculated

Health Index

Equipment’s Run Time (hr.) Alarm Boundary 20 40 60 80 100 50 100 150 200 250 300 350 400 Trend of Health Index PM!

(26)

Table 4. Expected downtime fraction (%) and PM threshold states. Time Ks 10 ¼ 10 s9 ¼ 9 s8 ¼ 8 s7 ¼ 7 s6 ¼ 6 s5 ¼ 5 s4 ¼ 4 s3 ¼ 3 s2 ¼ 2 s1 ¼ 1 Threshold state s*(m) t ¼ 51 1 9.579 7.719 7.319 4.149 1.428 1.273 1.204 1.162 1.137 1.121 2 2.308 1.540 1.356 1.237 1.243 1.256 1.279 1.317 1.373 1.454 3 1.349 1.274 1.255 1.272 1.342 1.399 1.467 1.570 1.702 1.864 4 1.247 1.268 1.294 1.358 1.453 1.536 1.632 1.756 1.887 2.015 5 1.243 1.298 1.350 1.433 1.523 1.597 1.677 1.762 1.833 1.888 6 1.257 1.322 1.381 1.457 1.516 1.559 1.601 1.636 1.657 1.669 Min. _ofE( ) 1.243 1.268 1.255 1.237 1.243 1.256 1.204 1.162 1.137 1.121 4 t =5 2 1 9.568 7.565 7.149 3.913 1.391 1.260 1.197 1.158 1.135 1.120 2 2.236 1.494 1.333 1.227 1.243 1.257 1.280 1.319 1.376 1.457 3 1.329 1.267 1.251 1.272 1.347 1.406 1.475 1.582 1.716 1.881 4 1.242 1.269 1.298 1.366 1.464 1.549 1.648 1.774 1.905 2.033 5 1.245 1.305 1.360 1.447 1.538 1.612 1.693 1.776 1.845 1.897 6 1.264 1.334 1.396 1.473 1.529 1.571 1.611 1.643 1.663 1.673 Min. _ofE( ) 1.242 1.267 1.251 1.227 1.243 1.257 1.197 1.158 1.135 1.120 4

(27)

by equation (22) is underlined for each state si. As can be seen, the minimum

expected downtime with k* ¼ 1 appears in state 1, 2, 3, and 4. State 4 has a largest health state value so that the threshold at time t ¼ 51 is set at state 4 by equation (23). Other threshold states at different time points can be obtained in the same way. The minimum downtime corresponding to the threshold states are highlighted in table 4. Figure 15 presents the PM alarm boundary formed by the threshold states over the time horizon from 0 to 150, where the states have been translated back to the corresponding value of health index (refer to table 2).

We can find that the PM alarm boundary in figure 15 increases along with equipment’s run time. This increasing boundary results from the increasing probability for the equipment’s condition to become worse. When the failure probability grows up with time, the expected downtime fraction due to failure repair will increase. Therefore, the increasing rate of the PM alarm boundary is greatly affected by the relation between PM downtime and the failure repair downtime. That is, the boundary will increase rapidly if the failure repair downtime is much more than the PM down time (i.e. D is large compared to R). On the other hand, the

61 64 67 70 73 76 79 82 85 88 91 94 97 100 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Time Health Index R=10, D =30 R =10, D =5

Figure 16. Comparison of PM alarm boundaries with different R and D (example). Figure 15. PM alarm boundary (example).

(28)

boundary will grow slowly with time if the additional downtime due to failure repair is relatively small. Figure 16 shows the PM alarm boundaries with different values of (R, D). The dotted line represents the boundary with a large D and the solid line represents the boundary with a relatively small D. As shown, a smaller D will result in a lower boundary.

5. Concluding remarks

In this paper, we have demonstrated a procedure to evaluate and prognosticate the equipment health. Although the proposed health index is rather ad hoc and more rigorous comparisons and discussions are needed, it serves as a starting point for utilizing the vast equipment sensor data to gain information on the equipment health. This paper has also presented a novel condition-based PM policy. To distinguish our approach from the conventional condition-based maintenance (CBM) and PM policies under Markovian or semi-Markovian deterioration model in the literature, we have named our approach a dynamic PM policy under aging Markovian deterioration. In CBM, the time-based maintenance, with the failure rate directly impacted by a hazard rate function, is extended to incorporate the equipment condition as a factor precipitating the hazard rate. The dynamic PM policy, in contrast, is rooted in a stochastic health model and the age is modeled as a factor hastening the equipment health deterioration. To ensure the applicability of the proposed approach, model estimation is also developed and the procedure is demonstrated through examples. In addition, the PM policies considered in this paper are only full-scale PM policies. Optimization models for other types of PM policies, such as minimum repairs, can be studied in the future research.

Acknowledgements

The authors would like to thank Professor Gertsbakh and two anonymous referees for their valuable suggestions. Professor R.-S. Guo’s skillful project management has helped the smooth completion of this research. This research is funded by NSC88-2212-E-002-063 project.

References

Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, E., A control-limit policy and software for condition-based mainteenance optimization. INFOR, 2001, 39, 32–50.

Barlow, R.E. and Hunter, L., Optimum preventive maintenance policies. Oper. Res., 1960, 8, 90–100.

Barlow, R.E. and Proschan, F., Mathematical Theory of Reliability, 1965 (John Wiley: New York).

Blanks, H.S. and Tordan, M.J., Optimum replacement of deteriorating and inadequate equipment. Qual. Reliab. Engng Int., 1986, 2, 183–197.

Chen, A., Guo, R.-S., Yang, A. and Tseng, C.-L., An integrated approach to semiconductor equipment monitoring. J. Chin. Soc. Mech. Enging, 1998, 19(6), 581–591.

Chen, A. and Tsai, C.C., Accomodating engineering knowledge in T2 _{control chart} construction for equipment FDC, in Proceedings International Symposium on Semiconductor Manufacturing, 2004, pp. 281–285 (Semiconductor Portal: Tokyo).

(29)

Cox, D.R., Partial likelihood. Biometrika, 1975, 62, 269–276.

Cox, D.R. and Oakes, D., Analysis of Survival Data, 1984 (Chapman and Hall: London). Dekker, R., Applications of maintenance optimization models: a review and analysis. Reliab.

Engng Syst. Safe., 1996, 51, 229–240.

Elsayed, E.A., Reliability Engineering, pp. 527–540, 1996 (Addison Wesley: New York). Gertsbakh, I.B., Models of Preventive Maintenance, pp. 229–244, 1977 (North-Holland:

Amsterdam).

Hogg, R.V. and Tanis, E.A., Probability and Statistical Inference, 4th ed., 1996 (Prentice Hall: Upper Saddle River, NJ).

Jardine, A.K.S., Maintenance, Replacement, and Reliability, 1973 (Pitman/Wiley: New York). Reprinted edition available from International Academic Services, PO Box 2, Kingston, Ontario, Canada, K7L4V6.

Jardine, A.K.S., Banjevic, D. and Makis, V., Optimal replacement policy and the structure of software for condition-based maintenance. J. Qual. Maint. Engng, 1997, 3(2), 109–119. Jardine, A.K.S., Banjevic, D., Wiseman, M., Buck, S. and Joseph, T., Optimizing a mine haul truck wheel motors’ condition monitoring program—use of proportional hazards modeling. J. Qual. Maint. Engng, 2001, 7(4), 286–300.

Kao, E.P.C., Optimal replacement rules when changes of state are semi-Markovian. Oper. Res., 1973, 21, 1231–1249.

Kaspi, M. and Shabtay, D., Optimization of the machine economics problem for a multistage transfer machine under failure, opportunistic and integrated replacement strategies. Int. J. Prod. Res., 2003, 41(10), 2229–2247.

Kotz, S. and Johnson, N.L., Process Capability Indices, pp. 37–115, 1993 (Chapman & Hall: London).

Kumar, D. and Klefsjo, B., Proportional hazards model: a review. Reliab. Engng Syst. Safe., 1994, 44, 177–188.

Leachman, R.C., Closed-loop measurement of equipment efficiency and equipment capacity. IEEE Trans. Semicond. Manuf., 1997, 10(1), 84–97.

Lin, D., Wiseman, M., Banjevic, D. and Jardine, A.K.S., An approach to signal processing and condition-based maintenance for gearboxes subject to tooth failure. Mech. Syst. Signal Proc., 2004, 18, 993–1007.

Makis, V. and Jardine, A.K.S., Optimal replacement in the proportional hazards model. INFOR, 1991, 30(1), 172–183.

Makis, V. and Jardine, A.K.S., Computation of optimal policies in replacement models. J. Math. Appl. Bus. Indust., 1992, 3, 169–175.

McCall, J.J., Maintenance policies for stochastically failing equipment: a survey. Manag. Sci., 1965, 11, 493–524.

Milioni, A.Z. and Pliska, S.R., Optimal inspection under semi-Markovian deterioration: basic results. Nav. Res. Log., 1988, 35, 373–392.

O’Sullivan, P., Martinzez, J., Durham, J. and Felker, S., Advanced fault detection for the semiconductor industry. Future Fab Int., 1996, 1, 71–73.

Ross, S.M., Introduction to Probability Models, 7th ed., 2000 (Academic Press: San Diego). Sherif, Y.S. and Smith, M.L., Optimal maintenance models for systems subject to failure—a

review. Nav. Res. Log. Quar., 1981, 28, 47–74.

Stamatis, A., Mathioudakis, K. and Papailiou, K., Optimal measurement and health index selection for gas turbine performance status and fault diagnosis. J. Engng Gas Turb. Power, 1992, 114, 209–216.

Taam, W., Subbaiah, P. and Liddy, J.W., A note on multivariate capability indices. Working paper, Department of Mathematical Sciences, Oakland University, Rochester, MI, 1991.

Takahasi, Y., and Takashi, O., TPM: total productive maintenance. Asian Prod. Ass., Tokyo, Japan, 1990.

Valdez-Flores, C. and Feldman, R.M., A servey of preventive maintenance models for stochastically deteriorating single-unit systems. Nav. Res. Log., 1989, 36, 419–446. Yeh, R.H., Optimal inspection and replacement policies for multi-state deteriorating systems.