Proceedings of MIMAR 2007

Transcription

Modelling in Industrial Maintenance and Reliability
Proceedings of MIMAR2007, the 6th IMA International Conference
10-11 September 2007, The Lowry Centre, Salford Quays,
Manchester, United Kingdom
Edited by
Matthew Carr
Philip Scarf
Wenbin Wang
Organised jointly by the Centre for Operational Research and Applied Statistics,
Salford Business School, University of Salford
and by the Institute of Mathematics and its Applications
United Kingdom
Preface
These proceedings present a collected volume of papers submitted for the Institute of Mathematics and
its Applications (IMA) 6th international conference on Modelling in Industrial Maintenance and
Reliability (MIMAR) held on the 10th and 11th September 2007 at the Lowry Centre in Salford Quays,
Manchester. The conference has been jointly organised by the IMA and the Centre for Operational
Research and Applied Statistics (CORAS) at the University of Salford. The MIMAR conferences
follow a three year cycle; previous conferences were held at the University of Edinburgh (1992, 1998)
and the University of Salford (1995, 2001, 2004).
The aim of the conference is to provide an overview of current research in industrial maintenance
and reliability modelling and discussion on areas of future research. Topics for presentation include;
life cycle analysis and maintenance strategies, inspections and replacements, condition monitoring and
condition-based maintenance, and warrantee analysis and logistics.
The conference involves researchers from all over the UK, China, Czech Republic, Brazil,
Canada, Japan, Kuwait, Sweden, Finland and more. All delegates were asked to produce a short paper
according to a particular format describing the work that they intended to present at the conference. It
is these papers that are collected together for presentation in this volume. Authors were asked to
construct the papers according to a specific format thus assisting the editing process and enabling the
proceedings to have a consistent style and presentation. The short papers enable authors to; (i) publish
research, (ii) provide the interested listener with supportive material during the course of
presentations, and (iii) allow authors to supply additional material that may be difficult to present to
complement the presentation, provide support and a point of reference for further discussion.
Where possible, the papers are organised according to the scheduled order of presentation in the
conference programme. When papers have not been supplied due to restrictions on time or the
required format, abstracts only have been included in these proceedings. In addition, authors have
been invited to submit for publication in a special issue of the IMA journal of ‘Management
Mathematics’ an extended version of the papers presented here. Papers submitted for the special issue
will be subject to a more rigorous refereeing process consistent with standard scientific journals.
Details are available on the IMA web-site and papers are due by 31st September 2007.
MATTHEW CARR
PHILIP SCARF
WENBIN WANG
August 2007.
Contents
(Organised according to the order of presentation at the conference)
Pg
Modelling different failure modes in CBM applications using a weighted combination
of stochastic filters
M J Carr and W Wang
1
Remaining useful life in condition based maintenance: Is it useful?
D Banjevic and A K S Jardine
7
Demand categorisation in a European spare parts logistics network (Abstract only)
A A Syntetos, M Z Babai and M Keyes
13
How academics can help industry and the other way around
R Dwight, H Martin and J Sharp
14
Stochastic modelling maintenance actions of complex systems
J R Kearney, D F Percy and K A H Kobbacy
18
Application of the delay-time concept in a manufacturing industry
B Jones, I Jenkinson and J Wang
23
A preventive maintenance decision model based on a MCDA approach
R J P Ferreira, C A V Cavalcante and A T de Almeida
29
Predicting the performance of future generations of complex repairable systems,
through analysis of reliability and maintenance data
T J Jefferis, N Montgomery and T Dowd
35
A note on a calibration method of control parameters in simulation models for
software reliability assessment
M Kimura
40
Estimating the availability of a reverse osmosis plant
M Hajeeh, F Faramarzi and G Al-Essa
45
The use of IT within maintenance management for continuous improvement
A Ingwald and M Kans
51
An approximate algorithm for condition-based maintenance applications
M J Carr and W Wang
57
The utility of a maintenance policy
R D Baker
63
Stochastic demand patterns for Markov service facilities with neutral and active
periods
A Csenki
68
Multicriteria decision model for selecting maintenance contracts by applying utility
theory and variable interdependent parameters
A J de Melo Brito and A T de Almeida
74
Spare parts planning and risk assessment associated with non-considering system
operating environment
B Ghodrati
80
Modern maintenance system based on web and mobile technologies
J Campos, E Jantunen and O Prakash
91
A literature review of computerised maintenance management support
M Kans
96
Some generalizations of age and block replacement (Abstract only)
P Scarf
102
Scheduling the imperfect preventive maintenance policy for deteriorating systems
Y Jia and X Chen
103
Contribution to modelling of dynamic dependability of complex systems
D Valis
110
Weak sound pulse extraction in pipe leak inspection using stochastic resonance
Y Dingxin and X Yongcheng
116
An empirical comparison of periodic stock control heuristics for intermittent demand
items (Abstract only)
M Z Babai and A A Syntetos
121
Minimising average long-run cost for systems monitored by the np control chart
S Wu and W Wang
122
Condition evaluation of equipment in power plant based on grey theory
J Li and S Huang
134
Modelling different failure modes in CBM applications using a weighted
combination of stochastic filters
Matthew J. Carr, Wenbin Wang
CORAS, University of Salford, UK.
[email protected], [email protected]
Abstract: In the context of condition-based maintenance (CBM), probabilistic stochastic filters provide an
established means of recursively estimating the residual life of an individual component using condition
monitoring (CM) information. In this paper, we consider the potential for modelling the impact of multiple
failure modes that exhibit specific types of behaviour and are identifiable using historical data. The behaviour
may be categorised according to the pattern of the observed CM information, the failure times of components,
or both. Stochastic filters are constructed for each contingency and we develop a Bayesian model that is used to
recursively evaluate the probability that the observed CM information corresponds to each of the failure modes.
The output from the individual filters is then weighted accordingly. Two scenarios are considered with the first
involving a fixed but unknown underlying failure mode and the second catering for transitions between the
failure modes over time. An example is presented using simulated data to illustrate the applicability of the
methodology.
1. Introduction
CBM applications utilise CM information when scheduling maintenance and replacement activities for
individual components. The components degrade stochastically over time under operational conditions and
CBM models are used to reduce the occurrence of costly and untimely failures. A number of relevant models
exist in the literature including proportional hazards models (Makis & Jardine (1991) and Vlok et al. (2002)),
accelerated failure-time models (Cox & Oakes (1984)) and stochastic filters (Wang & Christer (2000) and Wang
(2002)). The models are parameterised using data sets consisting of CM histories pertaining to analogous
components. For the research documented in this paper, we are considering CBM scenarios where multiple
failure modes exhibit themselves and are identifiable using historical CM data. We assume that the different
failure modes display behaviour that can be categorised according to the CM output, the failure times of the
components or both.
The methodology proposed in this paper involves constructing a stochastic filter for each of the defined
failure modes. A given filter is used to recursively establish a conditional density for the residual life at each
CM point under the relevant failure mode. An estimate of the residual life of a component is then defined as a
weighted combination of the respective output from the individual filters. Firstly, we consider a situation where,
the underlying dynamics (or failure mode) are assumed to be fixed and conform to one of the proposed models
for the case. This model is for use when the behaviour can correspond to a number of distinct behavioural types
and we are simply unaware which type the current component conforms to. An example considered in section 5
involves the modelling and estimation of the residual life of a component when the behaviour can correspond to
one of two potential failure modes. The behaviour is assumed to manifest itself in the form of failure time
clustering as demonstrated in figure 1. Separate models are established for each scenario and a recursive
procedure is developed to determine, during the life of a component, which model the underlying dynamics
conform to using both the age and the available CM history. We also consider the potential for the dynamics to
evolve or fluctuate during the life of a component. We assume that at any given stage, the underlying failure
mode conforms to one or more of the proposed models and that unknown transitions between the individual
failure modes occur over time. The transition probabilities must be estimated from available data and are
modelled using a Markov chain.
Figure 1. Illustrating the clustering of failure times when two different failure modes exist
1
2. An individual stochastic filter
In this section, we describe a stochastic filter designed to facilitate the residual life prediction for a component
under the jth individual failure mode ( j = 1, 2, …, r). xi is defined as the underlying residual life of a component
at the ith CM point at time ti . In addition, we observe a vector of CM parameters yi and have the availability of
the filtration or CM history ℑi = { y0, y1,…yi }. From Wang (2002), the posterior conditional probability density
for the residual life at the ith monitoring point given the history of CM information for a component is given by
p ji ( xi | M j , ℑ i ) = p ( xi | M j , y , ℑ i −1 ) = { p ( y | xi , M j , ℑi −1 ) p ( xi | M j , ℑi −1 )} / p( y | M j , ℑi −1 )
(1)
i
i
i
where, Mj represents the jth failure mode and yi and ℑi-1 are independent given xi and we have
p( y | xi , M j , ℑi −1 ) ≡ p( y | xi , M j )
i
i
(2)
The influence of the individual failure mode is reflected in the specification of the component parts of the
relevant filter i.e. a density for the initial residual life pj0(x0| Mj, ℑ0) = pj0(x0| Mj) and the density given by
equation (2) representing the stochastic relationship between the monitored information and the underlying
residual life. The second element of the numerator of equation (1) is derived as an updated version of the
residual life distribution from the previous recursion, at time ti-1, as
p ( xi | M j , ℑi −1 ) = p j ,i −1 ( xi −1 = xi + t i − t i −1 | M j , ℑi −1 ) / ∫
∞
ti −ti −1
p j ,i −1 (u | M j , ℑi −1 )du
(3)
The denominator of equation (1) is established as
p( y | M j , ℑi −1 ) =
i
∞
∫0
p ( y | xi , M j , ℑi −1 ) p ( xi | M j , ℑ i −1 )dxi
i
(4)
Using historical CM information and failure time data, the parameters of the individual stochastic filters are
established for W relevant histories using the likelihood function
W
⎛ nd
⎞
L(θ) = ∏⎜⎜ ∏ p ( y | M j , ℑ d ,i −1 )P jd ,i −1 ( x d ,i −1 > t di − t d ,i −1| M j , ℑ d ,i −1 ) ⎟⎟ p jdnd ( x dnd = Td − t dnd | M j , ℑ dnd ) (5)
di
⎠
d =1⎝ i =1
where a lower-case p represents a density function, an upper-case P represents a probability, θ is the unknown
parameters set and nd is the number of monitoring points in the dth CM history (d = 1, …., W).
3. Fixed failure mode
In this section, we discuss the weighted modelling approach for an individual component with a fixed
underlying failure mode. The problem is essentially one of competing risks (Crowder (2001)) and we construct
r different stochastic filters, each pertaining to an individual and distinct failure mode. The notation Mj
represents failure mode j ( j = 1, 2, …, r) and with the availability of a set of past CM histories, the jth stochastic
filter is parameterised using only those CM histories that correspond to failure mode j. The prior probability
that the underlying dynamics of the CM process for a given component correspond to failure mode j is denoted
as p(Mj|ℑ0) and is easily estimated from historical data as
p(Mj|ℑ0) = (Number of histories relevant to failure mode j) / (Total number of histories)
Considering a vector of condition monitoring parameters, yi, obtained at the ith discrete monitoring point at
time ti, we have
p ( M j | ℑi ) = p( M j | y , ℑi −1 )
(6)
i
as the conditional probability that the underlying dynamics of the current CM process correspond to failure
mode j given the CM history available until that point in time. By the application of Bayes’ law we obtain
p ( M j | y , ℑi −1 ) = { p( y | M j , ℑi −1 ) p( M j | ℑi −1 )} / p( y | ℑi −1 )
(7)
i
i
i
where, the initial probability p(Mj | ℑ0) is assumed to be known and p(Mj | ℑi-1) represents the probability that the
underlying dynamics conform to failure mode j and is available from the previous recursion of the process. This
is the means by which our best judgement regarding the actual underlying failure mode, and hence the residual
life of the unit, is updated at each monitoring point. We also have
p( y | M j , ℑi −1 ) =
i
∞
∫0
p( y | xi , M j ) p( xi | M j , ℑ i −1 )dxi
i
(8)
on the assumption that p( yi| xi, Mj, ℑi-1) = p( yi| xi, Mj), i.e. yi is controlled by xi and Mj only. The denominator of
equation (7) is obtained by enumerating over all the possible scenarios as
p( y | ℑi −1 ) =
i
r
∑ p( y i | M j , ℑi −1 ) p( M j | ℑi −1 )
j =1
A weighted mean estimate of the residual life can be obtained as
2
(9)
E [ x i | ℑi ] =
∞
∫0
x i p i ( x i | ℑi ) dxi
(10)
where the weighted conditional distribution is simply
r
p i ( xi | ℑ i ) = ∑ p ji ( x i | M j , ℑ i ) p( M j | ℑ i )
(11)
j =1
4. Failure mode transitions
To facilitate the modelling of underlying dynamics that can potentially vary over time as a component ages, we
introduce the notation Mji as being representative of the underlying dynamics conforming to failure mode j at the
ith monitoring point. A time-invariant Markov chain is established with transition probabilities
a k j = p( M ji | M k , i −1 )
(12)
that correspond to the conditional probability that the underlying dynamics currently conform to failure mode j
at the ith monitoring point given that they conformed to failure mode k at the previous monitoring point. The
objective of the combined modelling approach with evolving dynamics is to establish the conditional
distribution
r
p i ( x i | ℑ i ) = ∑ p ( xi | M ji , ℑi ) p ( M ji | ℑi )
(13)
j =1
Both terms in equation (13) require some explanation. The first is established as
p( xi | M ji , ℑi ) = p( xi | M ji , y , ℑi −1 ) = { p ( y | x i , M ji ) p ( xi | M ji , ℑi −1 )} / p ( y | M ji , ℑi −1 )
i
i
i
(14)
where, the relationship p( yi| xi, Mji) is available from the model specification and we have
p ( xi | M ji , ℑi −1 ) =
r
∑ p( xi | M ji , M k ,i −1, ℑi −1 ) p(M k ,i −1 | M ji , ℑi −1 )
(15)
k =1
In this context, p(xi | Mji, Mk,i-1, ℑi-1) = p(xi | Mk,i-1, ℑi-1) as the one step prediction of xi is available from the
previous recursion and is not dependent on the current model due to the lack of reliance on yi. We also have the
reverse transition expression
r
p ( M k ,i −1 | M ji , ℑi −1 ) = {a kj p( M k ,i −1 | ℑ i −1 )} / ∑ a kj p( M k ,i −1 | ℑ i −1 )
(16)
k =1
and the denominator of equation (14) is established as
p( y i | M ji , ℑ i −1 ) =
∞
∫0
p( y | xi , M ji ) p ( xi | M ji , ℑi −1 )dxi
(17)
i
Now we consider the second term of equation (13). Assuming the initial probability that the underlying
dynamics at the start of the CM process for a new component correspond to failure mode j, p(Mj|ℑ0), is known,
we again employ Bayes’ theorem to recursively obtain
p ( M ji | ℑi ) = p ( M ji | y , ℑi −1 ) = { p( y | M ji , ℑi −1 ) p( M ji | ℑ i −1 )} / p ( y | ℑ i −1 )
(18)
i
i
i
where the constituent elements of the numerator are
p ( y | M ji , ℑi −1 ) =
i
∞
∫0
p ( y | xi , M ji ) p( xi | M ji , ℑ i −1 ) dx i
i
(19)
r
p( M ji | ℑ i −1 ) = ∑ a kj p ( M k ,i −1 | ℑ i −1 )
(20)
k =1
and the denominator is given by enumerating over the prediction available from all the potential models as
r
p ( y | ℑ i −1 ) = ∑ p ( y | ℑ i −1 , M ji ) p ( M
i
j =1
i
ji
| ℑ i −1 )
(21)
5. Example – fixed dynamics
In this example, we consider the modelling and estimation of the residual life of a component using vibration
information when two potential failure modes are assumed to have been identified from relevant data in a
scenario similar to that illustrated in figure 1. When the monitoring process commences for a new component,
the underlying dynamics are assumed fixed but unknown, as described in section 3, and we develop two
separate stochastic filters (filter 1 and filter 2) to represent each potential eventuality. The filters are developed
using the same functional forms but are parameterised independently using relevant analogous component
histories. The filters are then conducted in parallel and their respective output weighted according to the
3
probability that the underlying dynamics correspond to the relevant failure mode. In this example, we simulate
a cycle of data in accordance with each modelling formulation and investigate the ability of the prescribed
methodology to track the appropriate underlying failure mode and the residual life of the component. The
estimate of the residual life at each monitoring point is then compared with the estimate from a general
stochastic filter (filter 3) that is developed and parameterised using all the available monitoring information, i.e.
the histories are not classified according to any failure mode and are all grouped together for parameter
estimation purposes. This is achieved by simulating a large number of cycles of CM data corresponding to
failure modes 1 and 2 and parameterising a general stochastic filter (filter 3) using all the simulated output. We
then compare the weighted output from filters 1 and 2 with the output from filter 3 to ascertain the benefit of the
combined modelling approach for this particular scenario.
From equations (1) – (4), the filtering expression for filter j is
p ( y i | x i , M j ) p j ,i −1 ( x i + t i − t i −1 | ℑ i −1 , M j )
p ji ( xi | ℑi , M j ) =
(22)
∞
∫ p( y i | xi , M j ) p j ,i −1 ( xi + t i − t i −1 | ℑi −1 , M j ) dxi
0
for j = 1, 2, 3. The constituent elements of filter j are the initial residual life distribution
p j 0 ( x0 | M j ) = α j (Γ( β j )) −1 (α j x0 )
β j −1 −α j x0
e
(23)
which is defined as a Gamma distribution for each model but parameterised independently. Similarly, the
distribution governing the conditional relationship between the observed vibration reading and the underlying
residual life is taken to be Gaussian for all the filters as
(24)
p( y i | xi , M j ) = (1 / σ ji 2π ) exp{− 1 / 2 (( y i − µ ji ) / σ ji ) 2 }
where, for filter j, we have µji = Aj + Bj exp{-Cj xi} as the expected vibration level at the ith monitoring point
given a particular realisation of the underlying residual life. We assume that the standard deviation parameter is
proportional to the vibration level as σji = djyi. The specified and estimated parameters of filters 1 and 2 are
given in table 1.
Parameter
Filter 1
Estimate
Filter 2
Estimate
A
B
C
d
5
17.5
0.025
0.12
0.2
45
5.11
17.3
0.027
0.126
0.218
44.205
5
21
0.01
0.15
0.1
75
4.981
20.07
0.011
0.141
0.115
74.504
α
β
Table 1. The estimated parameters of filters 1 and 2
The expected CM paths for the average life corresponding to model formulations 1 and 2 are illustrated in
figure 2. The general filter (filter 3) is constructed with the same forms as filters 1 and 2, given by equations (23)
and (24), and the parameters are estimated using 100 simulated histories. 50 of the histories are generated
according to failure mode 1 and 50 according to mode 2. The reasoning for this is that, for simplicity and to
demonstrate the methodology, we develop a scenario in which both contingencies are equally likely, i.e. before
the monitoring process begins, we have the initial probabilities p(M1) = p(M2) = 0.5.
Figure 2. Illustrating the expected CM paths for failure modes 1 and 2
Using the CM histories simulated according to failure modes 1 and 2, the estimated parameters of the
general stochastic filter (filter 3) are given in table 2.
4
Parameter
General Model
A
B
C
d
5.482
17.702
0.02
0.195
0.00778
3.266
α
β
Table 2. The estimated parameters of the general stochastic filter (filter3)
At the ith CM point, a closed form stochastic filtering expression is available for filter j as
( xi + t i )
p ji ( xi | ℑi , M j ) =
β j −1
∞
∫ 0 (u + t i )
β j −1
i
exp{− α j ( xi + t i ) − ∑ φ k ( xi , t i )}
k =1
i
(25)
exp{− α j (u + t i ) − ∑ φ k (u , t i )} du
k =1
for which we define the function
φ k (u , t i ) = (1 / 2σ jk2 )( y k − A j − B j e
−C j ( xi + ti −t k )
)2
(26)
An essential element in both the parameter estimation process and the determination of p(Mj | ℑi), see equations
(6) – (9), is the distribution p(yi | ℑi-1, Mj ) given by equation (13). For the functional forms used in this example,
we have
∞
p ( y i | ℑ i −1 , M j ) =
∫0
σ ji 2π
( xi + t i )
∞
∫ t −t
i
i −1
β j −1
i
exp{−α j ( xi + t i ) − ∑ φ k ( xi , t i )}dxi
k =1
(u + t i −1 )
β j −1
(27)
i −1
exp{−α j (u + t i −1 ) − ∑ φ k (u , t i −1 )}du
k =1
Failure times for the components are simulated using inversion on the initial life distribution p(x0) and the
vibration readings are then generated at each CM point using inversion on the conditional density
p( yi | xi). We now simulate a case corresponding to each of the two failure modes and demonstrate the ability of
the proposed methodology to track the appropriate mode and the underlying residual life. We compare the
estimations of residual life and the prediction errors obtained using the combined weighted modelling approach
(filters 1 and 2) with those obtained using the general stochastic filter 3 at each simulated CM point. The
prediction error at the ith CM point is
ei = (( xi − E [ xi | ℑi ]) 2 )1 / 2
(28)
The mean-square error (MSE) about the simulated failure time is used as a criterion for comparing the weighted
and general filters. Considering the weighted approach, the MSE attributable to each of the contributing filters
is weighted according to the probability that each model provides an appropriate representation of the
underlying dynamics for the particular component.
Residual life (hrs)
4.1 Case 1
For this first case, a cycle of CM data is simulated with the underlying dynamics corresponding to failure mode
1. The failure time for the cycle is 193 hours and figure 3 demonstrates the ability of the recursive process to
track the appropriate mode according to equations (18) – (21) using equation (27) developed for this specific
case.
500
Actual
400
Weighted
General
300
200
100
0
0
50
100
150
200
CM time (hrs)
Figure 3. Illustrating the tracking of failure mode 1 for
case 1
Figure 4. Comparing the residual life predictions
obtained using the weighted approach (filters 1 and 2)
and filter 3 for case 1
5
Figure 4 illustrates the tracking of the residual life at CM points throughout the life of the component. We
compare the estimations of residual life given by the combined weighted modelling approach proposed in
section 3 and the general filter.
Figure 3 clearly illustrates that the methodology tracks the appropriate failure mode for this particular case
and figure 4 demonstrates a clear improvement with the residual life prediction capability of the combined
modelling approach (filters 1 and 2) when compared with the general filter 3. In addition, the sum of squared
errors for the weighted approach is 808.19 compared with 1776.6 for filter 3. The superiority of the combined
approach is enhanced further by the MSE statistic of 345115 compared with 732541 for the general filter.
Residual life (hrs)
4.2 Case 2
For this second case, the CM process is simulated according to failure mode 2 with a failure time for the
component of 651 hours. Figures 5 and 6 illustrate the tracking of the appropriate mode and the residual life
respectively.
700
Actual
600
Weighted
500
General
400
300
200
100
0
0
200
400
600
800
CM time (hrs)
Figure 5. Illustrating the tracking of failure mode 2 for
case 2
Figure 6. Comparing the residual life predictions
obtained using the weighted approach (filters 1
and 2) and filter 3 for case 2
As with the first case, it is clear from figures 5 and 6 that the weighted approach (filters 1 and 2) tracks the
appropriate failure mode quickly for this second case and that the estimates of the residual life are more accurate
when compared with those obtained using the general filter 3. This conclusion is again confirmed by the fit
statistics; the sum of squared errors is 1422.9 for the weighted approach and 2234.3 for the general filter and the
MSE is 585240 for the weighted approach and 1050250 for the general filter.
6. Discussion
Cases 1 and 2 in the example have demonstrated that in some situations, it may be advantageous to group the
available CM histories and construct a number of probabilistic stochastic filters to represent the specified
contingencies (failure modes/types). The filters are then applied in parallel to new component CM information
and the output from each filter weighted according to the recursively derived conditional probability that the
filter is the appropriate representation of the current components underlying dynamics. The model introduced in
section 4 incorporating transitions between failure modes will be explored in future research and a study is
currently being conducted to test the application of the fixed mode methodology to an actual monitoring
scenario.
Acknowledgement
The research documented in this paper has been supported by the Engineering and Physical Sciences Research
Council (EPSRC, UK) under grant number EP/C54658X/1.
References
Cox, D. R. and Oakes, D. (1984) Analysis of Survival Data, Chapman and Hall
Crowder, M. (2001) Classical Competing Risks, Chapman & Hall/CRC
Makis, V. and Jardine, A. K. S. (1991) Optimal replacement policies in the proportional hazards model, INFOR, 30, 172-183
Vlok, P. J., Coetzee, J. L., Banjevic, D., Jardine, A. K. S. and Makis, V. (2002) Optimal component replacement decisions using
vibration monitoring and the Proportional Hazards Model, J. of the Operational Research Society, 53, 193-202
Wang, W. (2002) A model to predict the residual life of rolling element bearings given monitored condition information to date, IMA J.
of Management Mathematics, 13, 3-16
Wang, W. and Christer, A. H. (2000) Towards a general condition based maintenance model for a stochastic dynamic system, J. of the
Operational Research Society, 51, 145-155
6
Remaining useful life in condition based maintenance: Is it useful?
D. Banjevic, A.K.S. Jardine
CBM Laboratory, Department of Mechanical and Industrial Engineering, University of Toronto, 5 King’s
College Road, Toronto, Ontario, M5S 3G8, Canada
Abstract: Remaining useful life (RUL) is nowadays in fashion, both in theory and applications. Engineers use it
mostly when they have to decide whether to do maintenance, or to delay it, due to production requirements.
Most often, it is assumed that in later life of an equipment (in wear-out period), the hazard function is increasing,
and then the expected RUL, µ (t ) , is decreasing. It has been noticed that the standard deviation of RUL, σ (t ) ,
is also decreasing, which was expected, but that the ratio σ (t ) / µ (t ) is also increasing, which was a surprise.
Initiated by this observation, we have proved that under some general conditions, which include Weibull
distribution with shape parameter > 1, this is indeed the case. Even more, we have proved that the limiting
distribution of standardized RUL is exponential, so that the variability of RUL is relatively large. We may
conclude from this that in later life the point prediction of RUL is relatively inaccurate and may not be very
useful.
1. Introduction
In modern industrial environment it is common to monitor periodically, or even continuously, life and state of
operating equipment, particularly if its lifetime (time to failure) is a random variable and cannot be predicted
with certainty. Condition monitoring can provide information on current working age and state of the system
measured by some diagnostic variables and also on environmental conditions that may affect its future life. This
information then can be used for prediction of the remaining useful life of the system and planning of
maintenance activities. Let T be the time to failure of the system, and let the system survived until time t . Then
the “conditional” random variable X t = T − t (defined when T > t ), i.e., the remaining time to failure, is called
the “remaining useful life” (RUL) of the system. The conditional reliability function
Rt ( x) = Pt ( X t > x) = P(T − t > x | T > t ) contains all information required for prediction and planning of
future activities depending on RUL. For example, a maintenance decision policy depending on risk may be to
stop operation and do preventive maintenance at the first moment t when for fixed x (e.g., the length of an
regular inspection interval) probability of failure before x , Ft ( x) = 1 − Rt ( x) , exceeds certain predetermined
level. Another method may be to calculate (estimate) the expected residual life, often called mean residual life
(MRL), and use it either as a point estimate of RUL or to create a prediction interval for RUL. Obviously, MRL
value itself, even if correct, may not be very useful due to variation of RUL. On the other hand, if a failure
should be prevented, then the system should be stopped safely before MRL. Regardless how MRL is used, as a
function of t it is an important practical and theoretical quantity that describes aging of the equipment. MRL
function is closely related to the widely used hazard rate function h(t ) . This relationship will be considered in
detail later. Muth (1977) suggests that MRL is more informative and useful than the hazard function. It may be
the case, but h(t ) is so rooted in engineers’ psyche that it will be difficult to replace it with something else, even
if it is more useful. Then, a convenient relationship between µ (t ) and h(t ) is of interest.
2. Definitions and basic properties
Consider a nonnegative random variable T which represents random time to failure of an item. Let
R(t ) = P(T > t ) , t ≥ 0 , be its reliability function. Let for simplicity R (t ) > 0 for all t , and let T be absolutely
continuous so that its density function f (t ) , and its hazard function h(t ) exist. Let also H (t ) =
t
∫ h( x)dx be
0
the cumulative hazard function. Let X t = T − t be the remaining useful life of T at t , which is defined if T > t ,
and let Rt ( x) = Pt ( X t > x) = P (T − t > x | T > t ), x ≥ 0 , be its reliability function, ht ( x) its hazard function
and ft ( x) its density function. Then
R(t + x)
f (t + x)
Rt ( x) =
, ft ( x) = − ∂∂x Rt ( x) =
= h(t + x) Rt ( x) ,
R (t )
R (t )
7
ht ( x) =
x
t+x
ft ( x)
= h(t + x) , H t ( x) = ∫ ht ( s )ds = ∫ h( s )ds = H (t + x) − H (t ) .
t
0
Rt ( x)
Mean residual life function (MRLF) is defined as
∞
µ (t ) = Et X t = E (T − t | T > t ) = ∫ Rt ( x)dx .
0
Then µ (0) = µ = ET and is easy to see that MRLF is defined if and only if µ < ∞ , which we assume in the
following. Let us also note that µ t ( x) = µ (t + x) . One can easily see from µ (t ) =
h(t ) = −
∫
∞
t
R( x)dx / R(t ) that
R '(t ) µ '(t ) + 1
=
. This means that µ (t ) cannot be any positive function, but with slight restrictions
R(t )
µ (t )
that µ '(t ) ≥ −1 (from h(t ) ≥ 0 ) and
∞
∫
∫
∞
0
1 / µ (t )dx = ∞ . On the other side, the restriction on h(t ) is that
h( x)dx = ∞ . It is interesting to note that µ = µ (0) = E (1 / h( X )) . So, mean time to failure is the average of
0
the reciprocal hazard. More detailed relationship will be considered in the following. For other details on RUL
see Guess & Proschan (1988) and Reinertsen (1996).
Our interest is in deteriorating equipment, and from now on we will assume that the hazard function h(t )
is increasing (nondecreasing) for t ≥ t0 , for some t0 ≥ 0 . Note that all results in the following that involve t are
valid for t ≥ t0 . For simplicity we will assume that t0 = 0 . In practice the case of deterioration (“wear-out”) is
considered for more expensive and complicated equipment which is maintained regularly and fixed when fails
(which is most often a “minimal repair” which does not change overall deteriorating trend of the unit). The goal
is then to utilize the equipment as much as possible, until the end of its designed life, still trying to avoid a
catastrophic, nonrepairable failure. This situation may explain interest in the “remaining useful life”, its
distribution and expectation. In general, an IFR (increasing failure rate) distribution can be defined by the
property that Rt ( x) decreases in t for each fixed x , that is, the reliability for any fixed interval decreases with
age. It is easy to see that
∂
∂t
Rt ( x) = − Rt ( x)(h(t + x) − h(t )) , which means that a distribution is IFR if and only
if h(t ) is increasing. From the definition µ (t ) =
∫
∞
0
Rt ( x)dx , and property that Rt ( x) decreases in t , it follows
that µ (t ) also decreases in t, when the distribution is an IFR. The opposite is not true, as Muth (1977) shows by
a counterexample.
3. Properties of the variance of the remaining life
Our main interest here is to investigate
variance of RUL, that is
σ (t ) = Var ( X t ) = E[(T − t − µ (t )) | T > t ] = E[(T − t ) | T > t ] − µ ( t ) . Let function g ( x) be such that
2
behavior
2
of
2
∞
∞
0
0
the
2
Eg (T ) exists. Then Eg (T ) = ∫ g ( x)dF ( x) =g (0) + ∫ g ' ( x) R( x)dx , and then
∞
σ (t ) = 2∫ xRt ( x)dx − µ ( t ) .
2
2
0
Lemma 1. Let h(t ) be increasing. Then
(a) σ 2 (t ) is decreasing,
(b) σ 2 (t ) ≤ µ 2 (t ) .
Proof: We will first prove that
∫
∞
0
xRt ( x)dx ≤ µ 2 ( t )
(*)
∫
and µ (t ) ∫ R ( s ) ds ≥ ∫ ds ∫ R ( x )dx = ∫
from which both (a) and (b) will follow. For s ≥ t , µ (t ) ≥ µ ( s ) =
∞
µ (t ) R( s ) ≥ ∫ R( x)dx ,
s
∞
∞
∞
∞
∞
s
∞
t
t
s
t
R( x)dx / R( s ) , or
x
dxR ( x) ∫ ds =
t
∞
= ∫ ( x − t ) R( x)dx = ∫ xR(t + x)dx . By dividing the both sides of the previous inequality with R(t ) , we have
t
0
(*). Then for (b), σ (t ) ≤ µ 2 (t ) is obviously equivalent with (*). For (a), consider
2
8
∞
∞
(σ 2 (t )) ' = 2 ∫ x ∂∂t Rt ( x)dx − 2 µ (t ) µ '(t ) ≤ 0 , or − ∫ x[h(t + x) − h(t )]Rt ( x)dx ≤ µ (t ) µ '(t )
0
0
∞
∞
0
0
= µ (t )( µ (t )h(t ) − 1) . Then, − ∫ xht ( x) Rt ( x)dx + h(t ) ∫ xRt ( x)dx ≤ µ 2 (t )h(t ) − µ (t ) , or
∞
− µ (t ) + h(t ) ∫ xRt ( x)dx ≤ µ 2 (t )h(t ) − µ (t ) , which is (*).
0
The inequality (*) can be generalized to a very useful general inequality
Lemma 2. Let u ( x) ≥ 0 , g ( x) ≥ 0 , and (u ∗ g )( x) =
∫
∞
0
∞
∞
0
0
x
∫
0
u ( s ) g ( x − s )ds . Then
u ( x) Rt ( x)dx ∫ g ( x)Rt ( x)dx ≥ ∫ (u ∗ g )( x)Rt ( x)dx .
Proof: Similar as in Lemma 1.
E. g., if we put u ≡ g ≡ 1 , then (u ∗ g )( x) = x , and Lemma 1 follows.
The following identities are useful in deriving some results (see Bradley & Gupta (2003)). Let
function g ( x) be such that g ( x) µt ( x) → 0, x → ∞ . Then
(i)
(ii)
∫
∞
0
∫
∞
g ( x)Rt ( x)dx = g (0) µ (t ) + ∫ g '( x)µt ( x) Rt ( x)dx ,
0
∞
0
g ( x)Rt ( x)dx =
∞
g (0)
g ( x)
+∫ [
]' Rt ( x)dx .
0
h(t )
ht ( x)
Identities (i) and (ii) are easily obtained by partial integration. For g ( x) = x in (i) we get
∫
∫
∞
0
0
∞
0
∞
xRt ( x)dx = ∫ µt ( x) Rt ( x)dx . From µt ( x) = µ (t + x) ≤ µ (t ) , we have
∞
µt ( x) Rt ( x)dx ≤ µ (t ) ∫ Rt ( x)dx = µ 2 (t ) , or (*) in Lemma 1.
0
If
put g ( x) = µt ( x) in
we
∞
(i),
∞
we
= µ 2 (t ) + ∫ (ht ( x) µt ( x) − 1) µt ( x) Rt ( x)dx = µ 2
0
∞
∞
∫
(t ) + ∫ h ( x) µ
get
0
∞
0
∞
µt ( x)Rt ( x)dx = µt (0) µ (t ) + ∫ µt '( x)µt ( x) Rt ( x)dx
t
0
2
t
∞
( x) Rt ( x)dx − ∫ µt ( x) Rt ( x) dx , or
0
2 ∫ µt ( x) Rt ( x)dx = µ (t ) + ∫ ht ( x) µt ( x) Rt ( x)dx . (**)
2
0
2
0
In Lemma 1 we have proved that σ 2 (t ) ≤ µ 2 (t ) , or σ 2 (t ) / µ 2 (t ) ≤ 1 . We also have proved that both
σ 2 (t ) and µ 2 (t ) are decreasing. We are interested now in behavior of σ 2 (t ) / µ 2 (t ) . We will show that under
broad conditions which include Weibull distribution with shape parameter greater than 1, σ 2 (t ) / µ 2 (t ) ↑ 1 .
We need a technical result.
Lemma 3.
(a) If h(t ) µ (t ) is increasing , then h(t ) µ (t ) → 1 , and
∫
∞
0
xRt ( x)dx × (2 − h(t ) µ (t )) ≥ µ 2 (t ) .
(b) If 1/ h(t ) is a convex function (i.e. (1 / h(t ))' ↑ ), then h(t ) µ (t ) is increasing.
Proof: If h(t ) µ (t ) is increasing, then µ '(t ) = h(t ) µ (t ) − 1 is also increasing, i.e. µ (t ) is convex. We will
show that µ ' (t ) ↑ 0 , so that h(t ) µ (t ) ↑ 1 . From µ (t ) ↓ , µ '(t ) ≤ 0 , that is h(t ) µ (t ) ≤ 1 . From convexity of
µ (t ) it follows that for y > t ,
µ (t ) ≥ µ ( y ) + µ '( y )(t − y ) = µ ( y )+ | µ '( y ) | ( y − t ) ≥| µ '( y ) | ( y − t ) . Then
µ (t ) ≥ lim sup | µ '( y ) | ( y − t ) , so that mast be lim sup | µ '( y ) |= lim | µ '( y ) |= 0 . From the proof it also
y →∞
y →∞
y →∞
follows that µ '( y ) = ο ( 1y ) , y → ∞ . From ht ( x) µt ( x) = h(t + x) µ (t + x) ≥ h(t ) µ (t ) , and from (**),
∞
∞
2 ∫ µt ( x) Rt ( x)dx ≥ µ 2 (t ) + h(t ) µ (t ) ∫ µt ( x) Rt ( x)dx , and (a) follows. For (b), we need (h(t ) µ (t )) ' ≥ 0 , or
0
0
2
h ' µ + hµ ' = h ' µ + h(hµ − 1) = (h '+ h ) µ − h ≥ 0 , and using that h ' ≥ 0 ,
h
1
1
1
1
µ≥ 2
=
=
. If we use g ( x) ≡ 1 in identity (ii), we get
2
h + h ' h 1 + h '/ h
h 1 − (1/ h) '
9
∞
∞
1
+ ∫ ( ht 1(x ) ) ' Rt (x)dx . If 1 / h ↓ , then (1/ h) ' < 0 . If (1 / h)' ↑ , then
0
h(t ) 0
∞
1
1
1
1
1
1
) ' ∫ Rt (x)dx =
) ' µ ( t ) , or µ ≥
0 > (1/ ht ( x)) ' > (1/ h(t )) ' , and µ (t ) ≥
+(
+(
,
h 1 − (1/ h) '
h(t ) h(t ) 0
h(t ) h(t )
1
1
1
≤ µ (t ) ≤
. Note that also (1 / h)' ↑ 0 . Let us
as required . So, if 1/ h(t ) is convex, then
h(t ) 1 − (1/ h(t )) '
h(t )
denote s (t ) = 1/ h(t ) . Then s (t ) is decreasing and 0 ≥ s ' ↑ . From convexity of s (t ) it follows
s (t ) ≥ s( y ) + s '( y )(t − y ) = s( y )+ | s '( y ) | ( y − t ) ≥| s '( y ) | ( y − t ) . Then, as above for µ '(t ) , it follows that
s ' (t ) → 0 . From the proof it also follows that s '( y ) = ο ( 1y ) , y → ∞ . For a more general result, see Bradley &
µ (t ) = ∫ Rt ( x)dx =
Gupta (2003, Theorem 4).
Theorem 1.
If 1/ h(t ) is a convex function, that is (1 / h)' ↑ , then σ 2 (t ) / µ 2 (t ) ↑ 1 .
Proof: From Lemma 3, h(t ) µ (t ) ↑ 1 . Let G1 (t ) =
∫
∞
0
xRt ( x)dx . Then
σ 2 (t ) / µ 2 (t ) = (2G1 (t ) − µ 2 (t )) / µ 2 (t ) = 2G1 (t ) / µ 2 (t ) − 1 . Then we have to prove that v(t ) = G1 (t ) / µ 2 (t ) is
increasing, or v '(t ) ≥ 0 , or G1' µ − G1 2 µ ' ≥ 0 , or
∞
µ (t ) ∫ x ∂∂t Rt ( x)dx − 2G1 (t )(h(t ) µ (t ) − 1) ≥ 0 , or
0
∞
− µ (t ) ∫ x[ht ( x) − h(t )]Rt ( x)dx − 2G1 (t )( h(t ) µ (t ) − 1) ≥ 0 , or
0
− µ (t )( µ (t ) − h(t )G1 (t )) − 2G1 (t )(h(t ) µ (t ) − 1) ≥ 0 , or G1 (t ) × (2 − h(t ) µ (t )) ≥ µ 2 ( x) . The monotoneity then
follows from Lemma 3. Also, 1 ≥ v(t ) = G1 (t ) / µ 2 (t ) ≥ 1/(2 − h(t ) µ (t )) → 1 , t → ∞ . Then also
σ 2 (t ) / µ 2 (t ) = 2v(t ) − 1 ↑ 1 .
Example 1. For Weibull distribution, h(t ) = β (t / θ ) β −1 , s (t ) =
1 t 1− β
1− β t −β
( ) , s '(t ) =
( ) , and
β θ
βθ θ
β − 1 t − β −1
. Assume that β > 1 , so that s (t ) is convex, and 0 > s ' (t ) ↑ 0 , and then
( )
θ2 θ
σ 2 (t ) / µ 2 (t ) ↑ 1 . This result would be quite difficult to prove directly, using some special properties of
s ''(t ) =
Gamma function.
4. Limiting distribution of remaining useful life
It is of interest to investigate the limiting distribution of X t , properly normalized. From the property that
h(t ) µ (t ) ↑ 1 , and σ 2 (t ) / µ 2 (t ) ↑ 1 , it should be that
T −t
T −t
T −t
P(
> x | T > t ) have the same
> x | T > t ) , P(
> x | T > t ) , and P (h(t )(T − t ) > x | T > t ) = P(
s (t )
σ (t )
µ (t )
limit when t → ∞ , if the limit exists. In applications it would be the simplest to use the last value.
Theorem 2. Under the assumptions for which σ 2 (t ) / µ 2 (t ) ↑ 1 , that is (1 / h)' ↑ 0 ,
T −t
P(
> x | T > t ) → e− x , t → ∞ .
s (t )
T −t
> x | T > t ) = P(T − t > xs (t ) | T > t ) = Rt ( xs (t )) = exp{−[ H (t + xs (t )) − H (t )]}
Proof: P (
s (t )
= exp{− ∫
t + xs ( t )
t
∫
t + xs ( t )
t
h(u )du} . We have to prove that
h(u )du ≥ h(t ) ∫
t + xs ( t )
t
∫
t + xs ( t )
t
h(u )du → x , when t → ∞ . From h(t ) ↑ ,
du = h(t ) s(t ) x = x . Due to convexity of s (t ) , for all t , t0 ,
s (t + t0 ) ≥ s (t0 ) + s '(t0 )t , and for t0 = xµ (t ) , s (t + xµ (t )) ≥ s (t ) + s '(t ) xs(t )
10
t + xs ( t )
s (t )
x
→ x , because
x≤
s (t + s (t ) x)
1 + xs '(t )
s '(t ) → 0 when t → ∞ , which follows from the assumption that (1 / h)' ↑ , as shown in Lemma 3. From the
= s (t )(1 + xs '(t )) . Then
∫
t
h(u )du ≤ h(t + s (t ) x) s (t ) x =
proof of the theorem a bounds for the conditional distribution follow
−x
e
1
1− x|s '( t )|
≤ P(
T −t
> x | T > t ) ≤ e − x . (***)
s(t )
Example 2: For the case of Weibull distribution, for h(t ) = β t β −1 (with θ = 1 ), inequality (***) can be
slightly improved. Let, for simplicity, β ≥ 2 , so that h(t ) is convex (similar derivation can be obtained when
1 < β < 2 , that is, when h(t ) is concave). Consider also directly X t = T − t . Then
∫
t+x
t
h(u )du ≥ ∫
t+x
t
x
(h(t ) + h '(t )(u − t ))du = ∫ (h(t ) + h '(t )u )du = h(t ) x + h '(t )
0
x2
h '(t )
).
= xh(t )(1 + 12 x
2
h(t )
In a similar way,
h(t + x) − h(t )
h '(t + x)
) ≤ xh(t )(1 + 12 x
) , i.e.
h(t ) x
h(t )
exp{− x β t β −1 (1 + 12 x β t−1 (1 + xt ) β − 2 } ≤ P(T − t > x | T > t ) ≤ exp{− x β t β −1 (1 + 12 x β t−1 )} .
Let β = 3 and t = 1 (i.e., somewhere around the average life), then
exp{−3x(1 + x(1 + x)} ≤ P(T − t > x | T > t ) ≤ exp{−3x(1 + x)} .
∫
t+x
t
h(u )du ≤ 12 x(h(t ) + h(t + x)) = xh(t )(1 + 12 x
For example, the probability that the unit will survive one more average life (x=1) is not greater than
exp(−3 × 2) = 0.0025 = 0.25% , and that will survive at least a half of the average life (x=0.5) is not greater
than exp(−3 × 0.5 × 1.5) = 0.105 = 10.5% , but is at least 7.2% (from the lower bound). For larger t the
bounds are more accurate. From µ (t ) ≤ 1/ h(t ) = 1/ β t β −1 , and t = 1 , the MRL is not greater than 1/3.
5. Condition monitoring and remaining useful life
Incorporation of conditional information into “calculation” of RUL is of great importance in current industrial
practice with lot of condition monitoring and periodical inspections, but is a much more complicated theoretical
and practical problem than with age information only. In practical applications, some longer term predictions
are based on deterministic models, mainly empirical, depending on measures of deterioration. Results of regular
condition monitoring, such as from oil or vibration analysis, are used for short term predictions, mostly of risk
of failure. For an example of practical MRL estimation see Elsayed (2003). Some models such as the
proportional hazards, or accelerated life are often used in reliability/statistical approach. Most of the models are
based on experience and simple trending. Strictly speaking, these models are more empirical than based on a
sound theory, due to technical difficulties requiring a probabilistic model for behavior of covariates. If Z (t ) is
a covariate information (measurement) available at time t (which may also include all past information), the
conditional reliability function is Rt ( x | Z (t )) = P(T − t > x | T > t , Z (t )) . It requires a description of the joint
distribution of T and Z (t ) , which is a much more difficult problem than just a model for T . Usually, there is
only an indirect relationship between Z (t ) and T , often with lot of noise and irrelevant information in Z (t ) ,
which makes the problem even more difficult. The mean residual life function can be defined as
µ (t , Z (t )) = E (T − t | T > t , Z (t )) .
Very little is devoted to this function in literature (see, e.g. Maguluri & Zhang (1994), Kemp (2000), and Yuen
et al. (2003), where the covariate vector is time independent). A discrete Markov process model for Z (t ) is
considered in Banjevic & Jardine (2006), with application to transmission oil analysis data. As it is for the
function µ (t ) , which cannot be selected arbitrarily, but is subject to certain restrictions, the function
µ (t , Z (t )) is subject to even more restrictions, depending on stochastic behavior of Z (t ) . As it is also pointed
out by Jewell & Nielsen (1993), an ad hoc model for µ (t , Z (t )) (such as of regression type exp{γ ' Z (t )} )
cannot be formally used, unless intended for a fixed t , because it may violate the “consistency condition” for
Rt ( x | Z (t )) . For some MRL models see also Sen (2004), Muller & Zhang (2005), and Chen et al. (2005). In a
dynamic industrial environment, the models with covariates would be preferable, but are less used, due to
complexity and requirements for regular storage and retrieval of the condition information. Decisions are often
11
done ad hoc, using experience for short-term “emergency alarm” decisions to stop the operation. With
degradation variables that show slow development, the prediction of RUL is easier, and is used for planning of
maintenance.
Let us finish with words of an engineer when we asked him about RUL: “RUL – it is the operating hours
left on equipment before it has to be down for major repair. Some RUL is based on Vendor recommendations,
some are based on experience (e.g. wear on pumps) and some are based on deterministic analyses (e.g. crack
growth). However, the majority is experience-based. Due to changing demands of operation, it is useful to have
both predictions (remaining life and risk of failure). At our company, RUL is based on age and "normal"
operating conditions. PM's are scheduled based on typical RUL's but may be moved to accommodate either
unexpected deteriorating equipment conditions or unplanned production demands. [There are] good examples of
maintenance items that is based on RUL, [and other that] are good examples of maintenance items that is not
based on RUL but based on current condition. I guess a good summary of my comments is that when failure
modes are known or predictable, RUL becomes critical in scheduling maintenance. When failure
is unpredictable due to randomly changing conditions, then RUL becomes meaningless and maintenance
decisions are based on current condition.” This understanding of RUL may not be strictly as in the theory, but
the comments perfectly shows standing of RUL in real life situations.
Acknowledgment
This research was supported by the Manufacturing and Material Ontario and Natural Sciences and Engineering
Research Council of Canada.
References
Banjevic, D., and Jardine, A. K. S. (2006) Calculation of reliability function and remaining useful life for a Markov failure time process,
IMA J. of Management Mathematics, 17, 115-130
Bradley, D. and Gupta, R. (2003) Limiting behaviour of the mean residual life, Ann. Inst. Statist. Math. 55, 217-226
Chen, Y. Q., Jewell, N. P., Lei, X. and Cheng, S. C. (2005) Semiparametric Estimation of Proportional Mean Residual Life Model in
Presence of Censoring, Biometrics, 61, 170-178
Elsayed, E. (2003) Mean residual Life and Optimal Operating Conditions for Industrial Furnace Tubes, in Case Studies in Reliability and
Maintenance, ed. Blischke, W. R. and Murthy, D. N. P., Wiley, 497-515
Guess, F. and Proschan, F. (1988) Mean Residual Life: Theory and Applications, In Handbook of Statistics, 7, eds. Krishnaiah, P. R. and
Rao, C. R., Elsevier Science Publishers B.V., 215-224
Jewell, N. P., and Nielsen, J. P. (1993) A Framework for Consistent Prediction Rules Based on Markers, Biometrika, 80, 153-164
Kemp, G. C. R. (2000) When is a proportional hazards model valid for both stock and flow sampled duration data? Economics Letters,
69, 33-37
Maguluri, M. and Zhang, C. (1994) Estimation in the Mean Residual Life Regression Model, J. R. Statist. Soc. B, 56, 477-489
Muller, H. and Zhang, Y. (2005) Time-Varying Functional Regression for Predicting Remaining Lifetime Distributions from
Longitudinal Trajectories, Biometrics, 61, 1064-1075
Muth, E. (1977) Reliability models with positive memory derived from the mean residual life function, in The Theory and Applications
of Reliability, 2, ed. Tsokos, C. P. and Shimi, I. N., Academic Press, New York, 401-435
Reinertsen, R. (1996) Residual life of technical systems; diagnosis, prediction and life extension, Reliability Engineering and System
Safety, 54, 23-34
Sen, P. K. (2004) HRQoL and Concomitant Adjusted Mean Residual Life Analysis, in Parametric and Semiparametric Models with
Applications to Reliability, Survival Analysis and Quality of Life, eds. Nikulin, M. S., Balakrishnan, N., Mesbah, M. and Limnios,
N., Birkhauser, Boston, 349-362
Yuen, K. C., Zhu, L. X. and Tang, N. Y. (2003) On the mean residual life regression model, J. of Statistical Planning and Inference, 113,
685-698
12
Demand categorisation in a European spare parts logistics network
Aris A. Syntetos 1, M. Zied Babai 2, Mark Keyes 3
1, 2. Centre for Operational Research and Applied Statistics, Salford Business School, University of Salford,
Maxwell Building, The Crescent, Manchester M5 4WT, UK
3. Logistics Information Centre, Brother International UK, Brother House, 1 Tame Street, Audenshaw,
Manchester M34 5JE, UK
[email protected], [email protected], [email protected]
Abstract: Stock keeping units (SKUs) exhibit different demand patterns requiring varying methods for
forecasting and stock control. Software such as SAP requires users to categorise demand patterns before
selecting the technique that optimises the forecast. This planning functionality within SAP is typical of the
industry standard software packages available. This paper addresses issues related to the demand categorization
and its impact on the inventory control and service level performance of a large business machine manufacturer.
In particular we focus on the management of spare parts; the demand patterns exhibited by such SKUs may
range from relatively smooth and constant (normal), to intermittent and sporadic (non normal). By analysing the
effectiveness of the company’s previous ABC classification and then considering the latest research findings we
can see the impact of utilizing these new theories in a large organization’s spare parts management and contrast
differences in performance.
Keywords: Spare parts management, demand categorisation, forecasting, stock control
13
How academics can help industry and the other way around
R. Dwight, H. Martin, J. Sharp*, 1
1. University of Salford, UK.
[email protected]
Introduction
Many organisations are faced with an unprecedented level of competition. Markets are truly global now and
countries are participating in this global market with growing success, such as the former eastern block and
China. Also, former developing countries like India and south East Asia are growing economies in their own
right.
A growing and accessible global market means not just an increased business potential, but on the other
hand, requires also much closer attention to innovation, and creating effective and efficient business operations.
For many organisations, this means that “management-by-gut-feeling” may not be enough to be ready for
this global challenge. Business processes need to be analysed and possibly redesigned in a systematic and
justifiable manner. Management by trial and error may not give an organisation a second chance.
Viewed from this perspective, academic research aimed at improvement of business processes may prove
indispensable in supporting modern management. Several academic disciplines, such as industrial engineering,
organisational sociology and psychology, operations research, etc. are active in this particular area, but mostly
from a pure academic interest.
On the other hand, the academics have a basic responsibility to develop new and innovative pieces of
knowledge, i.e. to produce scientifically relevant knowledge. Essentially, the quality, and therefore the success
of a researcher, is very much dependent on the recognition of the quality of the work by peers, mainly through
the medium of publications. Creating as many publications in as high ranking journals as possible is a top
priority of a researcher. In addition, a moral claim of society, which finances most of the research directly or
indirectly, exists that this new knowledge has value through its application, i.e. to produce societal relevant
knowledge. This second type of relevance is actually much less clear and recognised for various reasons we will
discuss later on in the paper. We argue that not recognising the second type of responsibility and focusing solely
on the first type will create in the end a self-fulfilling prophecy in which academic success will be measured by
the degree to which a researcher complies with the academic establishment. Research goals are mainly
determined by a combination of pure academic curiosity and compliance with existing expectations set by peers.
Usefulness and applicability does not seem to be a priority.
In this paper we will discuss some key issues concerning the potential academic contribution to business
process improvement and development in general and maintenance processes in particular. The authors have
had hands on experience with both, maintenance management in industry, as well as with academia. We hope
that this creates a somewhat unique perspective and help us to analyse the so called “gap” between industry and
academics (e.g. see Scarf (2004)) from both sides of the fence. In doing so, we deliberately take a more extreme
position to enlarge our point and to provoke discussion. We realise that this paper doesn’t provide definitive
answers, but if we can fuel the debate constructively, then we consider this paper a success.
First in this paper, we will present an example of the gap between academics and practitioners to set the
scene. We then discuss and analyse some core issues that need to be resolved for closing the gap, and finally we
will conclude by raising a key question.
Academics and industry; the nature of the problem
Christer et al. (1995) discussed a case study where a company would not increase the time given over to
maintenance. Christer et al. (1995) were able to demonstrate that a considerable improvement in plant
productivity would be achieved following an application of Delay Time modelling. This showed that contrary to
the current belief within the company, increasing the time given over to maintenance actually increased the
available production time. The predicted increase was then achieved in practice. However, once the senior
company contact that the modellers had worked with left, there was no modelling skill or even awareness within
the company, which subsequently and unwittingly reverted to the previous and proven inefficient practice of
breakdown maintenance thus demonstrating that the company did not learn.
Different goal setting
Given the discussion in the introductory section, it is rather obvious that academics and industrialists have very
different main goals, which explains current behaviour quite adequately. Both groups need to be result oriented
if they want to be successful in their field. The industrialists don’t have a strict requirement to explain good
14
results. The results, i.e. a contribution to an organisation’s mission, are proof of success by itself. For
researchers, on the other hand, academic ethos usually demands methodological rigour, which usually entails
some consistency between a theoretical prediction, a test and a conclusion. The level of consistency determines
the quality of the theory rather than the prediction itself.
The net result of this discussion is that because substantial differences in goal setting exist between
industrialists and the academics, there is little motivation to develop joint efforts.
Disciplinary versus a problem oriented approach
Historically, science has evolved dramatically over the years. Although, doing research in accordance with strict
methodological standards has only started during the last 200 years, the body of knowledge has grown so fast
that specialisation has become a necessity quite quickly. Arguably, much of the research is driven by studying
phenomena, which in the end also drives the choices made for specialisation as science advances. In the early
days mathematicians were considered more or less universally knowledgeable in all types of strict abstracted
logic. Nowadays, the qualification of a mathematician is not sufficient to typify someone’s scientific
qualification. A mathematician has specialised in Bayesian statistics, fuzzy logic, fractal modelling, etc.
Interestingly, we may conclude that specialisation has also narrowed the scope in which phenomena are being
researched.
Industrialists in general don’t share the academics’ enthusiasm for phenomena. Instead, they have to deal
with problems on a day to day basis. Typically, these problems consist of many phenomena which probably
interact in a certain way. Therefore, the industrialist cannot afford to specialise in the same way as the
researchers do. If problems are more generic, the best that can happen is that practitioners specialise in the types
of problems in which they may have developed certain problem solving skills. All of this happens usually on an
individual basis or is driven by dogmatic “fashion statements” from self proclaimed management gurus.
Again, we may conclude that the problem oriented perspective of practitioners doesn’t align very well with
phenonomical or disciplinary specialisation of academics.
Scientific convenience or real life relevance?
One of the problems researchers face in their endeavour to test their theories is the so called lack of actual data.
Being a phenonomical researcher, it is methodologically imperative to use as realistic data as possible about the
phenomenon at hand. Unfortunately, collecting data is usually cumbersome, time consuming and costly.
Researchers interested in phenomena in real organisations may get into trouble in many ways. The data that is
actually required is not readily available in organisations, organisations are not prepared to spend time and effort
in collecting scientifically required data which has no other value from their perspective, organisations enforce a
certain level of secrecy and impose restricted or no access at all to the relevant data, etc. Most researchers have
not sufficient funds to compensate organisations for their participation. Instead, organisations are expected to
take their moral responsibility to support science. Needless to say, organisations rarely respond to this implicit
responsibility. Also, it is very difficult for both parties to actually explain and understand each other’s potential
stakes in a possible cooperation. Researchers have difficulties in understanding the relationship, if any, between
certain problems in an organisation and how their research may help that particular organisation in the end.
The net result is very often that practitioners carry on without any potentially refreshing input and
researchers are tempted to take shortcuts in their methodology. Because this problem occurs on such a big scale,
it seems that it is almost an accepted research policy to take shortcuts. E.g. researchers use other less valid
datasets they can get which need to be treated in a certain way before they can be used, or worse, simply certain
assumptions are made. In recent years, researchers now can make use of elaborate assumed realistic simulations
as a stand in for actual situations, which is also a questionable practice.
Of course, it is difficult to criticise researchers at the very core of their ethos, but every researcher still has
this responsibility to search for the best objective way to test their hypothesis and in what way corner-cutting
will ultimately impair the quality of their work. You, as a responsible researcher, will be the judge of that.
What happened with Quality management?
At the end of the second world war Deming (1986) and others (Crosby (1979) and Juran (1989)) revolutionised
management thinking by introducing the concept of quality. In particular, Japanese organisation’s embraced the
quality concept fully and implemented quality principles without compromise. Many attribute the Japanese
success on the international markets to the rigour and dedication of the Japanese to the quality principle. At its
essence, quality improvement requires people to iterate through the quality circle; “plan-do-check”, which is in
effect a learning cycle. So, it is relatively safe to assume that improvement requires some sort of learning. But,
that is what researchers do all the time. Yet, industry in particular is struggling with what is nowadays called
15
organisational learning and what it actually means (Burgoyne (1995)). Apart from the fact that organisational
learning is still in its infancy in scientific terms and that it is far from being a closed case, many organisations
have little affinity with learning. Education is mostly considered a government responsibility. Industry sees
itself predominantly as a consumer acting in a human resource market, in which more or less competent
individuals can be “bought” when needed and rejected if deemed insufficient. In particular, the latter belief is
clashing with organisational learning.
We realise that this line of reasoning is bit stretched and hard to substantiate, but can managers really
afford to take a bystander’s position when it comes to learning, or do we leave that to the Japanese? Is it
possible that industry can learn (no pun intended) from researchers and gain a little more appreciation for a
methodological approach to address problems, or is the basic quality circle just an empty phrase?
What is the value of research?
Research needs to be financed. Because of all kinds of difficulties in measuring the value of research in terms of
concrete society benefits, we take the pragmatic approach of indirect measurement. A thorough discussion of
the problems of measuring the value of scientific research exceeds the scope of this paper by far. Besides, plenty
of publications have addressed this problem (Gray (1999)). But, in view of our problem of closing the gap
between academics and industrialists, it would be nice if not only the cost of research can be made visible, but
also the planned revenue. Thinking in terms of cost and revenues touches the basic industrialist’s mindset.
Arguably, not much effort has been done lately to improve this valuation problem. It seems academics have
accepted the current “best practice” of counting publications as the main criterion for success. The government,
being the main financer of research in many regions of the world, has settled for this type of measuring. In fact,
this type of valuation has been institutionalised everywhere and has become the main criterion in most scientific
accreditations.
Strangely, the government doesn’t act as a true financer and doesn’t demand value for money. At least the
government could remind researchers about their moral responsibility we briefly introduced as the second type
of responsibility in the introduction. Many ambitious researchers are a bit reluctant to do any statements on this
part if not explicitly asked for, e.g. when filling in an application for a research grant. Perhaps, some researchers
feel that any bonding with practicality would lead them to be classified as “applied researchers”, which in turn,
would drive them away from reaching the Olympic temple of science. We admit that this is a bit exaggerated
and several governmental funded research projects do require attention for applicability of research. But still, we
feel this is not truly recognised as important nowadays in the scientific world. Applied research is by nature
much more problem oriented and could potentially alleviate the problem to value the applicability of envisioned
research output. Yet, applied research is generally not seen as a scientific career booster.
What about self esteem?
Apart from having difficulty to express the value of research explicitly a researcher cannot escape the feeling
when discussing with industrialists of being regarded as some kind of odd creature. Generally speaking,
researchers are not driven by advancing their career and their self esteem through increase of their salary, by the
beauty and exuberance of their office, etc. (Pun intended). Supposedly, researchers are driven by sheer curiosity
and fancy the respect of their peers. We apologise for picturing this stereotype image of industrialists and
researchers, but we believe that a cultural gap exists between the two groups, which can make it harder to take
one another seriously.
How does maintenance fit into this discussion?
The discussion so far was quite general in nature and left maintenance out of the equation. Actually, we didn’t
address maintenance specifically so far, because the previous discussion is very helpful in understanding what
happens in maintenance.
Firstly, maintenance is more problem oriented than (mono-) disciplinary by nature. This creates many of
the problems identified already. Quite a few researchers with a disciplinary research background have no real
understanding of the problem area called maintenance. E.g. many papers are published by researchers who
haven’t ever seen a real maintenance department, let alone worked there and understand the true complexity and
interrelationships between highly heterogeneous processes at work. Yet, these researchers claim relevance and
ponder why they are not taken seriously by the practitioners. Understanding maintenance problems requires a
multi-disciplinary approach. This, in turn, calls for applied research without taking shortcuts via oversimplified
and unrealistic assumptions. Metaphorically speaking; just like a biologist must go out in the wild to study the
behaviour of primates, a maintenance researcher must go out and do his studies in real maintenance
environments.
16
Unfortunately, applied research requires applied researchers who are on the brink of scientific extinction
and therefore, are in short supply.
How can collaboration between academics and industry be improved?
If we look back at all the problems identified and agree on them, it looks next to impossible to change the world.
But in good scientific tradition, we can claim that once we have identified the problem, we can work on a
solution methodologically. And eventually, we have no doubt, we could change the situation. At least that is the
expected rational answer. The real question here is do we really want to change? Most problems relate to
incompatibilities in attitude, which can be changed by sheer willpower. We suggest that industrialists,
academics and the government look at this question first and deal with the details later.
One attempt to close the gap was described by Christer (2005) who with Sharp, a co-author of this paper,
jointly organised a three day IFRIM/EPSRC Workshop sponsored by the EPSRC, of the UK in the Spring of
2004 to identify the future direction for Maintenance research. It was organised as a high level academicindustrial workshop inviting senior industrial representatives with key responsibilities for maintenance and
reliability, and recognised academic researchers with a track record of addressing real industrial maintenance
problems. The remit of the workshop was to investigate the gap between maintenance theory and practice, to
consider if it exists, and if so why it exists, consider if the gap is worth closing, and if so what mechanism might
be considered.
The consensus from the workshop was that a gap exists that is worth closing. Perhaps more significant for
the future, during the workshop the perception of the industrial members as to what was possible as far as
modelling and decision-aids were concerned changed markedly. There are clear future challenges facing both
academic and industrial communities with a payoff to both. The former need to be more closely allied to
industry and proactive in formulating proposals for collaboration, and industry needs to be better informed at the
workface, become more receptive to testing new ideas and occasionally moving outside a comfort zone.
As Christer (2005) and Dwight (2005) both pointed out both the industrialists and academics felt the gap
was worth closing yet three years later nothing appears to have changed that demonstrates academia and
industry are working closer together.
References
Burgoyne, J. (1995) Feeding minds to grow the business, People Management, 1 (19), 2 – 6
Christer, A. H. (2005) Closing the gap between maintenance theory and practice, Keynote Address at the Int. Congress of Maintenance
Societies, Australia, April 2005
Christer, A. H., Wang, W., Baker, R. and Sharp, J. (1995) Modelling maintenance practice of production plant using the delay time
concept, IMA J. of Mathematics Applied in Business and Industry, 6, 67 – 84
Crosby, P. B. (1979) Quality is free: The art of making quality certain, New York: New American Library
Deming, W. E. (1986) Out of the crisis, Cambridge, MA: Massachusetts Institute of Technology
Dwight, R. (2005) Industrial-academic interface in academic research, Paper presented at the Int. Reliability and Maintenance Conf.,
Toronto, Canada, November 2005
Gray, H. (1999) Universities and the Creation of Wealth, Open University Press, Buckingham, UK
Juran, J. (1989) Juran on leadership for Quality: An Executive handbook, The Free Press, New York, USA
Scarf, P. (2004) Results of EPSRC Sponsored Survey, Paper presented at the IMA Conference on Reliability and Maintenance,
University of Salford, May, 2004
17
Stochastic modelling maintenance actions of complex systems
J. Rhys Kearney *, David F. Percy, Khairy A. H. Kobbacy
Salford Business School, University of Salford,Greater Manchester, M5 4WT, England
[email protected]
Abstract: When a system is maintained there are many different actions which can be performed to improve its
reliability, such as the replacement of aged components, the reconfiguration of mechanisms or the cleaning and
lubrication of moving parts; it is assumed that each of these actions would then contribute to an overall
improvement in the performance of the system as a whole. Few assumptions are made about the structure of the
system in the hope that effects can be estimated from history data. Presented is a discussion on the form of the
adjustment that should be made to the system rate of occurrence of failure to best capture the physical properties
of the intervention be it replacement of parts of the system or preventive maintenance with limited efficacy.
1. Introduction
A system will comprise a number of subsystems and components that are interconnected in such a way that the
system is able to perform a set of required functions. As adopted by Rausand & Hoyland (2004), the term
functional block will be used to denote an element of the system, whether it is a component or a large subsystem.
The general definition of reliability in the ISO 8402 standard is ‘the ability of an item to perform a required
function, under given environmental and operational considerations, for a stated period of time’. According to
the IEC50(191) standard, failure is the event when a required function is terminated while a fault is ‘the state of
an item characterized by inability to perform a required function’; a fault is hence the system state resulting
from failure.
This work’s focus is on modelling the reliability of repairable and maintainable technical systems of a
complex nature. A repairable system is one which, upon failure, can be restored to satisfactory performance by
any method other than replacement of the entire system (Ascher & Feingold (1984)). A system which is
repairable will also be considered maintainable in this work. Here the time taken to repair the system on failure,
referred to as corrective maintenance (CM), is considered negligible in comparison with the length of time the
system is in operation and is therefore not considered in the modelling process. The purpose of preventive
maintenance (PM) is to improve system reliability – maintenance actions are carried out on the system with the
intention of retaining or restoring the system to a certain level of functional performance. These could include
the repair, replacement, cleaning, lubrication or reconfiguration of functional blocks.
There have been many proposed methods for modelling the effect of maintenance actions; most of these
models are variations of the simple renewal process and the nonhomogeneous Poisson process. Ascher &
Feingold (1984) claimed that these are the fundamental models for replacements and repairs, respectively.
Discussion follows on these modelling methodologies with regards to factors which need to be taken into
consideration in order to develop an accurate mathematical representation of the physical manifestation of the
type of maintenance action which is undertaken on complex systems.
2. Maintenance actions and system dependencies
Given its operational history, the reliability of a complex system is an aggregate measure of the reliability of its
functional blocks given the system’s structural dependencies; that is, the system ROCOF is dependent on within
component interactions otherwise referred to as stochastic dependence (Nicolai & Dekker (2006)). The basic
assumption is that the state of a functional block, given by its age, failure rate or other measure of condition, can
influence the state of other components in the system. Suppose a functional block of a system is replaced by one
which was regarded as newer and more reliable; one might infer that the reliability of the system as a whole
would improve as a result. This would be the case if the block was functionally independent, but as part of a
complex system the resultant effect on overall reliability can be more difficult to quantify.
Take a simple example of a two unit system illustrated in Figure 1, a pump which compresses gas into a
sealed chamber via a system of pipes and valves. Valves 1 and 2 on the diagram are one-way in that the gas can
only travel in the direction indicated; the third valve is also one-way but only releases when a certain pressure in
the chamber is attained. The system is operating satisfactorily if the blasts of gas coming out of Valve 3 occur
within prescribed ranges of pressure, frequency and duration. If any of these criteria are not met, the system is at
fault. Say it is observed that the pressure of the blasts being released by the third valve has fallen below the
acceptable level and the frequency of the blasts has increased. The maintenance engineer makes the decision to
replace Valve 3. This corrects the fault bringing the system performance back within the acceptable limits, the
previous valve having structurally failed as a result of repeated usage. Dependencies on this replaced block may
18
affect the reliability of others in the system. For example, the fixed valve resulting in increased work from the
pump may alter the pump’s failure mode; the system may fail sooner as the rate of degradation of the pump
increases.
2.
3.
Figure 1. A repairable system of a compressor pump and chamber
To assess the reliability of the system and the effect of repair and maintenance actions with certainty for
any system would require detailed measurement of system parameters and investigation into their dynamic
behaviour given inter-component and subsystem dependencies which would become increasingly difficult for
ever more complicated systems. The proposed model in this work is one which takes a top-down view of system
reliability taking a holistic view of the measurement system reliability and of maintenance actions via a system
wide failure intensity function which implicitly models the effect of dependencies.
3. Multiplicative scaling
The core accepted model of failure modes is based upon the nonhomogenous Poisson process which has the
capacity to describe nonstationary interfailure times. The proportional intensities model (PIM), introduced by
Cox (1972), is suited to the modelling of repairable systems as the effects of maintenance actions are captured in
adjustments made to the rate of occurrence of failures (ROCOF). Percy & Alkali (2006) proposed a PIM where
the intensity function is multiplicatively scaled upon failure and repair such that
N (t )
λ (t ) = λ0 (t )∏ si
(1)
i =1
60
60
40
40
Intensity
Intensity
where si > 0 are constants representing the intensity scaling factors and N(t) is the number of corrective
maintenance actions up to and including time t. The scaling factors si can take the form of a positive constant,
random variable, specified function of the number of failures i or of the times when these occur, ti, or random
variables with evolving means.
In this work, a multiplicative adjustment is considered more conceptually justifiable than one which is
additive, which could potentially result in a negative intensity, a premise of multiplicative scaling is that the
absolute effect of a maintenance action is proportional to the health of the system at the time the maintenance
action is carried out. Say a system has a monotonic increasing baseline intensity function i.e. the system health
is degrading with usage resulting in an increased frequency of failures, as is the case for the intensity functions
illustrated in Figure 2a). It is reasonable to assume for this common scenario that the benefits of a maintenance
action will increase as system health degrades – there would be little benefit in maintaining a new system in
good working order relative to a system which has suffered atrophy through continued usage. This effect can be
most simply captured by a constant proportional reduction of the intensity function where si = ρ. Figure 1
illustrates a power-law intensity scaled multiplicatively to represent the effect of maintenance action either at
t = 4 or t = 8 by a constant ρ = 0.5. The effect is more pronounced when the system ROCOF is larger at t = 8,
however, the resultant system intensity is the same for either maintenance period.
20
20
0
2
4
6
8
0
10
2
4
6
8
Time
Time
Figure 2a) b) Constant proportional reduction on increasing intensity
19
10
Implicit in the assumption of proportional scaling of the intensity function is that the maintenance action
has the equivalent effect of making a proportion of the system perfectly reliable; i.e. if ρ. = 1/2 one can state that
of all those things in the system which were contributing to the rate of occurrence of failure prior to the
maintenance action taking place, 50% of them have been eradicated by the maintenance action. If we take a
holistic view of the system, this is equivalent to half the system being made perfectly reliable. One difficulty in
scaling the intensity function by a multiplicative constant to represent the effect of a maintenance action is that
this part of the system which, in theory, becomes perfectly reliable remains as such, ceasing to contribute to the
system failure intensity since the rate of change of the ROCOF is also multiplicatively scaled such that
(2)
λ '(t ) = λ0′ (t ) ρ N (t ) .
Figure 2b) illustrates the scenario where the baseline is scaled by a constant ρ. = 1/2 at times t = 2, 4, 6 and 8,
the repeated scaling resulting in an exponential reduction in the proportion of the system which contributes to
the system failure intensity.
4. System repair by component replacement
When the maintenance action results in a proportion of the system being replaced, it is supposed that this is
equivalent to a proportion of the system being renewed. The baseline intensity function, λ0 (t ) , represents the
failure rate of the system in the absence of any maintenance action. It is assumed that this system-wide intensity
function can be scaled to represent the failure intensity of a proportion p of the system; that is, components of
the system are independent and identically distributed. This model is based upon an overhaul model proposed
by Zhang & Jardine (1998) who attribute a proportion p of an intensity function to that of the system in the
previous overhaul period. At time ti a proportion p ∈ [0,1] of the system’s components are completely renewed
during a maintenance action, the corresponding system intensity function for the period following the
maintenance action, λi (t ) , given a system intensity function prior the action, λi −1 (t ) , is given by
λ (t ) = pλ0 ( t − ti ) + (1 − p ) λi −1 ( t ) ; ti < t < ti +1
(3)
On solving the recurrence relation, this can be written as
λi (t ) = ∑ p1−δ (i , j ) (1 − p ) λ0 ( t − ti − j ); ti < t < ti +1
i
j
(4)
j =0
where
⎧0; i ≠ j
(5)
⎩1; i = j
We illustrate the intensity function corresponding to this model in Figure 3 for a quadratic baseline intensity
function with p = 1 2 and repairs at time t1 = 100 and t2 = 200 .
Equation 3 proposes a simple way to model the effect of replacement on the system reliability, an
assumption of the replacement policy modelled is that all components of the system are equally likely to be
chosen for replacement, that is, no account is taken of the relative health or age of components in the system.
Further investigations into allowing for the inclusion of relative component health in the replacement policy will
be presented in future work.
δ ( i, j ) = ⎨
intensity
10
5
0
100
200
300
time
Figure 3. Typcial effect of replacements on intensity function
20
The effects of dependencies between the replaced components and system reliability need further
investigation. One hypothesis is that the replaced components will age faster due to the increasing rate of
failures induced by the older components of the system, tending toward the systemic age. The functional age of
the newer components could be moderated to include include such an effect by rewriting equation 3 as
λ (t ) = pλ0 ⎡⎣t − ti exp {−φ [t − ti ]} ⎤⎦ + (1 − p ) λi −1 ( t ) ; ti < t < ti +1
(6)
where φ > 0 is an unknown constant.
(
)
5. Preventive maintenance
Percy & Alkali (2006) identified limitations with some of the proposed models for preventive maintenance,
including difficult interpretation and poor definition. They concluded that a generalization of the proportional
intensities model offers a flexible stochastic process that adapts readily to a variety of applications where only
event history data are available. As touched upon in Section 3, constant scaling of the system intensity function
implies the beneficial effect of a maintenance action is permanent in that it results in an equivalent proportion of
the system becoming perfectly reliable, we suppose this is not the best way to model the effect of preventive
maintenance actions considered here to be those maintenance tasks which restore the system to a level of
functional performance, other than the replacement of components, which is known to have a limited efficacy
given system usage.
The need for modelling the decay of maintenance effects was noted by Jack & Murthy (2002), who
proposed investigating the idea for additive adjustments of the intensity function. Unless an additive reduction is
a function of time (here taken as the measurement of system usage), the adjustment has no effect on the rate of
change of the system ROCOF. This means that an additive reduction implies that the effect of a maintenance
action is an instantaneous shift in the absolute frequency of system failure, but the rate at which the failure rate
changes remains unaffected – the system is made younger in absolute terms but continues to age at the same rate
but from a lower frequency of failure. In a virtual age model, a maintenance action is supposed to restore the
intensity function of the system to that of an earlier time, the assumption here is that the overall effect of
maintenance action is that of globally making the system younger in all aspects of failure behaviour. Neither of
these behaviours seem intuitively sensible. Instead we propose the effect of a maintenance action should be
proportional to the health of the system i.e. multiplicative scaling, but that the maintenance action should have
limited efficacy such that the effect decays with time.
The first decay factor that we consider is exponential in nature and involves defining the scaling factor in
equation 1 to be of the form
(7)
si = 1 − (1 − ρ ) exp −φ ( t − ti )
{
}
in terms of two unknown parameters, ρ > 0 and φ > 0 . Note that si = ρ immediately after preventive
maintenance, corresponding to the constant proportional intensities model but we now have the realisitic
scenario that the effect of preventive maintenance vanishes over time because si → 1 as t → ∞ .
Although this decay factor is satisfactory, we can improve upon it by imposing instantaneous proportional
intensities scaling upon performing preventive maintenance. To do this, the gradient of the intensity function
must instantaneously scale by the same factor ρ as affects the actual function value. We achieve this by
modifying equation 7 so that the scaling factors in Equation (7) are now in the time-squared form
{
si = 1 − (1 − ρ ) exp −φ ( t − ti )
2
}
(8)
again in terms of two unknown parameters, ρ > 0 and φ > 0 . The effect then degrades in the shape of an Scurve such that initially the rate at which the effect degrades is slight but then increases until easing off as few
benefits of the maintenance remain, similar to the logarithmic adoption curve. Figure 4 displays a graph of this
intensity function for a quadratically increasing baseline intensity function with ρ = 1 2 φ = 12,000 and preventive
maintenance at times t1 = 100 and t2 = 200 .
21
intensity
10
5
0
100
200
300
time
Figure 4. Typical effect of preventive maintenance on intensity function
The diminishing effects of preventive maintenance interventions are clear from this graph, the system
intensity tending to the baseline intensity as the time since the action increases. The baseline then could be
regarded as the rate at which failures occur if no interventions are made on the system, and that maintenance
actions act as a shock shift from this baseline which, due to through system entropy, degrades returning the
system to its un-maintained ROCOF.
6. Conclusion
One measure of the immediate effect of maintenance action on the system ROCOF introduced in Section 3 is
the equivalent proportion of the system which is made perfectly reliable, if all parts of the system are considered
independent and identically distributed. It was noted that constant multiplicative scaling is not a sensible
assumption because mathematically this is equivalent to that proportion of the system remaining perfectly
reliable for the duration of the systems operation. To counter this unrealistic behaviour, a proportional renewal
was proposed in section 4 for component replacement and decay parameterisation in section 5 for preventive
maintenance to better approximate the effect of maintenance actions.
References
Ascher, H. and Feingold, H. (1984) Repairable systems reliability: modeling, inference, misconceptions and their causes, New York,
Dekker, M.
Cox, D. R. (1972) The statistical analysis of dependencies in point processes, Stochastic Point Processes
Jack, N. and Murthy, D. N. P. (2002) A new preventive maintenance strategy for items sold under warranty, IMA J. of Management
Mathematics, 13, 121-129
Nicolai, R. P. and Dekker, R. (2006) Optimal Maintenance of Multi-Component Systems: A Review
Percy, D. F. and Alkali, B. M. (2006) Generalized proportional intensities models for repairable systems, IMA J. of Management
Mathematics, 17, 171-185
Rausand, M. and Hoyland, A. (2004) System reliability theory: models, statistical methods, and applications, Hoboken, NJ, WileyInterscience
Zhang, F. and Jardine, A. K. S. (1998) Optimal maintenance models with minimal repair, periodic overhaul and complete renewal, IIE
Transactions, 30, 1109-1119
22
Application of the delay-time concept in a manufacturing industry
B. Jones*, I. Jenkinson, J. Wang
Liverpool John Moores University, Liverpool, UK.
[email protected]
Abstract: This paper has been written to give a methodology of applying delay-time analysis to a maintenance
and inspection department. The aim is to reduce downtime of plant items and/or reducing maintenance and
inspection costs. A case study of a company producing carbon black has been included to demonstrate the
proposed methodology.
Keywords: Maintenance, inspection maintenance, delay-time analysis
1. Introduction to delay-time analysis concept
Delay-Time Analysis (DTA) is a concept whereby the time h between an initial telltale sign of failure u and the
time to actual failure can be modelled in order to establish a maintenance strategy. Delay-time is the period of
time when inspection or maintenance could be carried out in order to avoid total failure. Figure 1 illustrates the
delay-time concept (Christer & Waller (1984)).
Figure 1. The delay-time concept
2. Methodology
In order to develop a maintenance model using delay-time analysis a methodology needs to be developed in
order to give the process a framework. Delay-time analysis can be used as a tool for reducing the downtime,
D(T) (Christer et al. (1995)) of a machine or a piece of equipment based on an inspection period T, given the
probability of a defect arising within this time frame b(T). For a particular plant item, component or series of
machines, delay-time analysis is useful because the equipment in question is generally high volume and high
capital expense, therefore any reduction in downtime due to breakdown or over inspection can be beneficial. As
with the modelling of downtime per unit time, it is also possible to establish a cost model, C(T) (Leung & Kitleung (1996)), again based on an inspection period T and probability b(T), this model estimates the expected
cost per unit time of maintenance. This modelling has also been used for safety criticality (Pillay & Wang
(2003)) on a fishing vessel giving safety criticality of a failure and operational safety criticality.
A methodology for applying delay-time analysis is proposed as follows:
• Understand the process.
• Identify the problems.
• Establish data required.
• Gather data.
• Establish parameters.
• Validation of the delay-times and the distribution.
• Establish assumptions.
• Establish a downtime model D(T) and cost model C(T).
When the probability distribution function of a delay-time f(h) follows an exponential distribution, i.e. when the
failure rate λ or 1/MTBF is constant over a specified time period, the distribution function, as shown in equation
(1), is used to calculate the probability of a defect arising b(T):
f ( h ) = λ e − λh
The probability of a defect leading to a breakdown failure b(T) can be expressed as follows in equation (2).
(1)
T
⎛T − h⎞
b(T ) = ∫ ⎜
⎟ f (h)dh
T
⎠
⎝
0
Inserting the distribution function f(h) into the breakdown failure probability b(T) gives
23
(2)
T
⎛ T − h ⎞ − λh
b(T ) = ∫ ⎜
⎟λe dh
T
⎝
⎠
0
(3)
This term can be further simplified as
b(T ) =
1
T
T
∫ (T − h)λe
− λh
(4)
dh
0
It is important to note that b(T) is independent of the arrival rate of a defect per unit time (kf) but it is dependant
on the delay-time h.
2.1 Downtime model D(T)
It has been demonstrated (Leung & Kit-leung (1996)), (Pillay et al. (2001)) that with establishing a probability
for breakdown failure b(T) it is also possible to establish an expected downtime per unit time function D(T) as
shown in equation (5).
⎧ d + k f Tb(T ) d b ⎫
D(T ) = ⎨
(5)
⎬
T +d
⎩
⎭
where,
d
= Downtime due to inspection.
= Arrival rate of defects per unit time.
kf
b(T)
= Probability of a defect arising.
= Average downtime for a breakdown repair.
db
T
= Inspection period.
Substituting b(T) from equation (4) into equation (5) gives
⎡1 T
⎤
d + k f T ⎢ ∫ (T − h)λe −λh dh ⎥ d b
0
T
⎣
⎦
D (T ) =
T +d
(6)
2.2 Cost model C(T)
Similarly, given the cost of inspection Costi, the cost of a breakdown CB and the cost of inspection repair CIR,
the expected cost per unit time of maintenance of the equipment with an inspection of period T is C(T), giving
k f T {Cost B b(T ) + Cost IR [1 − b(T )]} + Cost i
C (T ) =
(7)
(T + d )
where,
C(T) = The expected cost per unit time of maintaining the equipment on an inspection schedule of
period of time T.
CostB = Breakdown repair cost.
CostIR = Inspection repair cost.
Costi = Inspection cost.
[
]
The cost of an inspection is shown in equation (8).
Costi = (Costip + Costd) Tinsp
where,
Costip = Cost of inspection personnel per hour.
Costd = Cost of downtime per hour.
Tinsp
= Time taken to inspect.
(8)
The cost of a breakdown is calculated as the cost of the failure plus the costs of corrective action to bring the
equipment back to a working condition. The details of a breakdown repair are shown in equation (9).
CostB = (Mstaff + Costd) (Tinsp + Trepair) + Sp + Se
(9)
where,
Mstaff = Maintenance staff cost per hour.
Trepair = Time taken to repair.
Sp
= Spares and replacement parts cost.
= Special equipment / personnel / hire costs.
Se
24
The cost of an inspection repair is somewhat identical to the breakdown repair cost apart from the following:
• Inspection repair will not generally have equipment hire costs (Se).
• The time to repair will be of shorter duration for inspection repair.
The time for an inspection repair having a shorter duration is mainly due to a breakdown having a greater
knock-on effect. The equation for inspection repair is shown in equation (10).
CostIR = (Mstaff + Costd) (Tinsp + Trepair) + Sp
(10)
A point to note regarding the cost model C(T) (equation 7) is that it describes a worst case scenario. This worst
case scenario is a fault leading to failure before an inspection takes place or a fault being detected at inspection.
Conversely, a best case scenario would be no failure taking place before inspection and no fault being present at
inspection.
3. Case study
In order to demonstrate the above models for downtime D(T) and cost C(T) a case study of a factory producing
carbon black in the UK is given.
This particular process of creating carbon black is made up of three units A, C & D. The three units cover
the whole process stream from the reactor, MUF (Main Unit Filter) which collects & separates the product from
the gasses produced and conveying of the carbon black into storage containers. A low pressure air and natural
gas produce a flame of high temperature (1500 degrees centigrade) in the combustion zone of the reactor. Heavy
oil, which is known as feedstock, is sprayed into the flame and the carbon black reaction occurs. After the
feedstock is exposed to the high temperature it is quenched with water in order to stop the carbon black
formation reaction. At this point the basic form of carbon black is formed, carbon black powder. The filter is a
bag type filter measuring approximately 10cm in diameter and 2.5m in length. The cost of a filter is around £28
each with a life expectancy of three to four years. There is a second manufacturer of the filter that has a cost of
around £7.50 but it has a life expectancy of between 12 to 14 months with a lower tolerance to acid than the
more expensive filter.
3.1 Costs of a failure
When a filter bag is to be changed the compartment has to be closed down. This requires 8 hours of cool down
followed by a period of 6 and 24 hours downtime for repair and replacement then a further 2 hours to warm the
unit back up, if a total re-bag is required downtime is generally around 7 days. When a unit is brought off-line it
continues to burn gasses in order to keep the temperature in the reactor constant thus wasting energy. Also the
system allows any energy created can be used by the facility and any surplus energy is sold back to the national
grid, therefore any downtime can be costly in respect of not just wasting energy but also potential income from
surplus energy. Sometimes specialist maintenance crews need to be brought in to deal with the problem. A
typical example of a breakdown which took 7 days to repair and replace all bags is demonstrated below.
• Loss of production per hour: £1,500
• Burn of gasses per hour: £238
• Loss of export of energy per hour: £26
• Cost of maintenance personnel per hour: £28
• Cost of supervisor per hour: £36
• Cost of replacement filters (205): £40,180
• Jetting crew: £710
• Jetter hire: £300
• Cherry picker hire: £2,500
This gives a total cost for a breakdown resulting in 1,435 filters being replaced effecting 1 MUF for a period of
7 days to be £350,794.
3.2 Establishing a delay-time analysis
In order to establish a delay-time analysis for this example several parameters need to be known. The
parameters used in this example are as follows:
• The arrival rate of a defect, kf - 0.28 per day.
• Mean time between failure (MTBF) - 3 years.
• Downtime for an inspection, d - 0.1 days.
25
•
•
•
•
Downtime for breakdown repair, db
Breakdown repair cost, CostB
Inspection repair cost, CostIR
Inspection cost, Costi
-
7 days.
£350,974.
£5,000.
£67.
Applying the parameters to equation (6) it is possible to establish an inspection interval where a minimum
downtime is of primary concern as illustrated in figure 2.
D(T) Expected downtime per unit time
0.100
0.090
0.080
0.070
0.060
D(T) 0.050
0.040
0.030
0.020
0.010
0.000
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Days
Figure 2. Optimal inspection period based on minimum downtime D(T)
As illustrated in figure 2 the minimum inspection interval based on minimum downtime D(T) is 14 days. When
the cost C(T) is of primary concern the optimum inspection interval is 11 days with a cost of £940 as shown in
figure 3. If the inspection interval was moved to 14 days in line with minimum downtime the cost would rise to
£977 which is a nominal increase of £37.
C(T) Expected cost per unit time
£5,000
£4,500
£4,000
£3,500
£3,000
C(T) £2,500
£2,000
£1,500
£1,000
£500
£0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Days
Figure 3. Optimal inspection period based on minimum cost C(T)
4. Validation
In order to analyse the effect of change to the results of D(T) and C(T) a sensitivity analysis was carried out on
each model. The analysis varied certain input data by 5% and 10% resulting in the following.
4.1 Validation of D(T)
The optimal inspection interval remains very close to the original interval given an increase and decrease of 5%
and 10%. The sensitivity analysis for D(T) is shown graphically in figure 4.
26
Sensitivity analysis based on D(T)
0.100
-10%
0.090
-5%
0.080
0%
+5%
0.070
+10%
0.060
D(T) 0.050
0.040
0.030
0.020
0.010
0.000
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Days
Figure 4. A graphical representation of the sensitivity analysis.
4.2 Validation of C(T)
A sensitivity analysis was carried out on the cost of an inspection repair and the cost of an inspection in order to
analyse the effect of a change in the costs. The cost of an inspection repair and an inspection has been increased
and decreased by 5% and 10%. The sensitivity analysis is shown graphically in figure 5. The optimal inspection
interval remains very close to the original interval given an increase and decrease of 5% and 10%.
Sensitivity analysis based on C(T)
£6,000
-10%
£5,000
-5%
0%
£4,000
+5%
+10%
C(T) £3,000
£2,000
£1,000
£0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
Days
Figure 5. A graphical representation of the sensitivity analysis.
5. Discussion
It has been demonstrated in this case study that an optimal inspection interval taking into account a minimum
downtime D(T) of 14 days has been established using the delay-time analysis technique. Using minimum cost
C(T) as the criteria an inspection interval of 11 days with a cost of £940 was calculated.
Current practice at the company is that of a weekly inspection interval involving a flame check and a cloth
check. It can be argued that this inspection interval could move to a 2 weekly interval but given the nature of the
two inspection checks and the fact that it does not stop production, a weekly inspection interval appears
reasonable.
6. Conclusion
This case study looked at a company in the UK producing carbon black. This paper demonstrates the delay-time
concept for the use of minimising downtime and costs, setting inspection intervals to achieve this. Information
was gathered form historical data as well as expert judgement, with parameters established from this
information in order to develop the delay-time models.
Acknowledgements
The authors wish to thank Mr G.Wright and Mr A.Whitehead for their kind help for providing data and other
necessary information.
27
References
Arthur, N. (2005) Optimization of vibration analysis inspection intervals for an offshore oil and gas water injection pumping system, J. of
Process Mechanical Engineering, 219, Part E, 251-259
Christer, A. H. and Waller, W. M. (1984) Delay-time models of industrial inspection maintenance problems, J. of the Operational
Research Society, 33, 401-406
Christer, A. H., Lee, C. and Wang, W. (2000) A data deficiency based parameter estimating problem and case study in delay-time PM
modelling, International J. of Production Economics, 67, 63-76
Christer, A. H., Wang, W. and Baker, R. D. (1995) Modelling maintenance practice of production plant using the delay time concept,
IMA J. of Mathematics Applied in Business and Industry, 6, 67-83
Christer, A. H., Wang, W., Choi, K. and Sharp, J. (1998a) The delay-time modelling of preventive maintenance of plant given limited
PM data and selective repair at PM, IMA J. of Mathematics Applied in Business and Industry, 15, 355-379
Christer, A. H., Wang, W., Sharp, J. and Baker, R. D. (1998b) A case study of modelling preventative maintenance of a production plant
using subjective data, J. of the Operational Research Society, 49, 210-219
Leung, F. and Kit-leung, M. (1996) Using delay-time analysis to study the maintenance problem of gearboxes, International J. of
Operation and Production Management, 6 (12), 98-105
Pillay, A. and Wang, J. (2003) Technology and safety of marine systems, Elsevier Science Publishers Ltd., Essex, UK, ISBN: 0 08
044148 3, 149-164, 179-199
Pillay, A., Wang, J., Wall, A. D. and Ruxton, T. (2001) A maintenance study of fishing vessel equipment using delay-time analysis, J. of
Quality in Maintenance Engineering, 7 (2), 118-127
International Carbon Black Association (ICBA) (2004) Carbon Black users guide. Safety, health and environmental information
The environmental protection act 1990. Chapter 43
Integrated pollution prevention control (IPPC) European commission (2006) http://ec.europa.eu/environment/ippc
OREDA (2002) In offshore reliability data, 4th edition, SINTEF Industrial Management
28
A preventive maintenance decision model based on a MCDA approach
Rodrigo José Pires Ferreira, Cristiano Alexandre Virgínio Cavalcante, Adiel Teixeira de
Almeida
Federal University of Pernambuco, Caixa Postal 7462, Recife - PE, 50.630-970, Brazil
Abstract: Maintenance techniques have an essential role in keeping systems available, especially in the
competitive and expanding environment of the services sector. Failures have several negative implications, and
the cost minimization model, frequently used in a manufacturing context, is not seen as efficient for many
maintenance problems in a service context. With this in mind, quality of service is related to customer
perception, the monetary value of which is difficult to estimate.
This paper deals with the problem of replacement in service production systems. A maintenance
multicriteria decision aiding model is proposed, based on integrating the PROMETHEE method and the
Bayesian Approach. This allows decision makers to establish replacement intervals. A numerical application
illustrates the decision model proposed and shows the model’s effectiveness regarding the decision maker's
preferences.
Keywords: multicriteria decision, preventive maintenance, maintenance policies
1. Introduction
Maintenance management has a vital role in organizations by virtue of its being seen to have increasing
importance over recent years. In the early-to-mid-twentieth century, maintenance was characterized as being a
predominantly corrective activity, and this became more complex from the middle of the century as a result of
industrialization. Later, at the end of the century, maintenance became more critical and important because of
automization. The maintenance management area developed into various segments, such as Preventive
Maintenance, Condition monitoring, Reliability Centered Maintenance (RCM), Total Productive Maintenance
(TPM) and several applications of Operations Research (OR) (Dekker & Scarf (1998)).
Given today´s high degree of competitiveness, the efficiency of maintenance management represents an
element of competitive differential for companies both for one possessing high capital involvement and because
of its direct relationship with the quality of products and services. In the goods sector, the inefficiency of
maintenance management can directly result in the loss of production such as excessive use of overtime, hiring
staff, stocks and other damages.
In the services sector, the inefficiency of maintenance management can be critical and cause irreversible
losses. Some reasons for this are that the service is given simultaneously and it cannot be stored. The level of the
consequences in this sector can be visualized when lives are put at risk, should equipment failures occur, for
example, during a serious medical surgery. Other examples are failures in air transportation equipment and
failures in the distribution of electricity, which cause several direct and indirect consequences of great impact.
These consequences are difficult to quantify in monetary units, but they must be taken into consideration and
evaluated adequately in each context. Therefore, service not given in a satisfactory way because of the nonavailability of equipment can generate customer dissatisfaction. At the same time, customers will change their
perception of the given service and probably stop requesting services from this company in future.
Preventive Maintenance consists of actions that, in the attempt to prevent the occurrence of failures, are
anticipated by substituting parts of the system, and therefore, in the terminology used in this paper, mention is
made of the plan to substitute equipment or parts that can fail in operation, unless a substitution is made in time.
In this context, preventive maintenance is appropriate for equipment where its failure rate increases with use
(Glasser (1969), Barlow & Proschan (1965) and Barlow & Proschan (1975)).
According to Lam (2006), a preventive repair is usually adopted for improving a system’s reliability and
implementing it more economically. In many cases, such as in a hospital or a steel manufacturing complex, a cut
in electricity supply may cause a serious catastrophe. A preventive repair is a very powerful measure, since
preventive repair will extend the system’s lifetime and raise reliability at a lower cost rate.
This paper presents a proposal for a decision model in the preventive maintenance area, the objective of
which is to determine intervals of preventive maintenance in order to get a higher return of decision-making
regarding system reliability and the cost expected of this maintenance policy.
Thus, an analysis of reliability and cost simultaneously allow a policy of preventive maintenance to be
established under the multicriteria decision aiding methodology approach based on the PROMETHEE method
proposed by Brans & Vincke (1985) with Bayesian reliability analysis (Martz & Waller (1982)). This proposal
29
allows decision maker preferences to be dealt with in an appropriate way. The Bayesian proposal is used to
apply the model in order to overcome the absence of data on failure.
Many authors have described different models for Bayesian analysis applied to preventive maintenance.
Brint (2000) considers a model that allows the interval to be extended between preventive maintenances actions
for assets where failures can be catastrophic. Makis & Jardine (1992) consider a model for optimal replacement,
the purpose of which is to specify simple rules for substitution in order to minimize, in the long run, the average
cost expected by unit of time. Mauer & Ott (1995) observe the importance and the effect of uncertainties on
costs. Thus, they follow a similar line to the previous authors. They present a model where the objective is to get
the optimal value of time between substitutions, while taking into account uncertainty on the cost. Percy &
Kobbacy (1997, 2000) consider stochastic models when there is still data available. They emphasize the
intervention of preventive maintenance in order to prevent system failure. Silver & Fiechter (1995) deal with the
problem of periodic maintenance under the approach of Bayesian analysis combined with heuristic procedures.
Within the context of programmed maintenance, there are works that use the multicriteria decision aiding
approach. Quan (2007) models a multi-objective problem and uses evolutionary algorithms to solve this
problem introducing a form of utility theory to find Pareto optimal solutions. Kralj & Petrovic (1995) use a
multi-objective approach. Gopalaswamy et al. (1993) propose a multi-criteria decision model, where three
criteria are considered: the rate of cost of minimum substitution, the maximum availability and the reliability of
the base component. Almeida (2004) presents a model based on multi-attribute utility theory. Lotfi (1995)
presents a model based on multiple objective mixed linear programming. Chareonsuk et al. (1997) use the
multi-criteria methodology PROMETHEE to establish the interval between maintenance actions taking into
account two important criteria: cost by unit of time and reliability.
Section 2 of the paper describes the problem and its features. Section 3 develops hypotheses concerning
preventive maintenance and decision maker preferences for developing the decision model. Section 4 presents a
numerical application illustrating the results of the analysis of the methodology proposed, and Section 5
provides a discussion of the results.
2. The problem
In this section, the principal characteristics of the problem of preventive maintenance are presented, including
the basic structure and the context of the problem.
Initially, the problem considers preventive maintenance can be applied to a piece of equipment, to a
component or a system. The policy for replacement by age is a procedure that consists of replacing an asset by a
reserve one at the moment it fails or when it reaches a time of life T (replacement age). The asset reserve will be
submitted to the same rules as the asset that is being replaced. This policy is only effective if the replacement
cost before the failure happens, and provides some savings when compared to replacement due to failures.
The main issue in this replacement policy is to determine the age at which an asset should be replaced at
the lowest cost by unit of time of use. In many cases, cost represents the most important or only aspect of
decision maker preferences. This situation is frequently seen in the production sector for goods. However, in the
services sector, the decision maker can show a preference for minimizing undesirable consequences which are
difficult to measure in financial units, due to their complexity. According to Almeida (2004), in this context, the
customer is in direct contact with the production system. The output is produced while the customer is being
served. Therefore, the product received by the customer is affected by problems due to failures in the production
system. As a result, losses due to failures, or interruptions for preventive maintenance, cannot simply be counted
in monetary form. In the future, the interruption consequences in the service can affect the wish of the customer
to make a contract with that supplier, or to cancel the current contract. In this case, the consequences of failures
cannot be transformed into costs. The objectives in this system endeavour to reduce costs when considered as
part of a mix with other objectives, such as: availability, reliability of the production system, time during which
the system is interrupted, and quality of the service.
3. The decision model
The objective of the decision model is to determine the frequency of preventive maintenance in order to take
advantage of this decision regarding decision maker preferences and the possible difficulties in estimating the
probability distribution function of failures. A Bayesian reliability approach has frequently been applied in such
cases (Martz & Waller (1982)).
Through a multicriteria approach the set of alternatives of this problem is represented by the time, T, in
which the activity of preventive maintenance will be carried out. In other words, the decision maker wants to
evaluate the alternatives of time and to determine the one that is best adapted to his preferences and the
recommendation as to the best time for preventive maintenance to take place. The determination of the timing of
30
replacement provides different features both for the reliability and for the structure of maintenance costs. Before
this, the criteria of the decision model considered are the expected cost of maintenance, Cm(T), and reliability,
R(T), assuming that the decision maker wishes to take into consideration these two criteria simultaneously,
instead of just considering the expected cost of maintenance, Cm(T).
The decision model should take into account the decision maker’s subjective aspects for making the
decision, because, for each context, the criterion Cm(T) can require different levels of importance. For instance,
if the consequence of the failure of a piece of hospital equipment is associated with deaths, the relative
importance given to criterion R(T) may well be higher than in other contexts.
In the literature, there are few applications of the multicriteria decision support methodology in the area of
programmed maintenance. This paper presents a multicriteria approach which considers the treatment of
uncertainties related to maintenance data. There is a diversity of texts that tackle the problem of equipment
replacement, on account of very different focuses and of considering several aspects. However, a feature of such
works is the use of the optimum paradigm, where only one objective function is considered.
The hypotheses of the model are:
- The set of alternatives is integer; in other words, there is a finite number of alternatives regarding
replacement times;
- The equipment is subject to use; in other words, the equipment presents an increasing rate of failure;
- The replacement of a piece of equipment or part gives the system a good-as-new performance;
- The times of failure of the equipment can be modeled using probability distribution;
- The parameters of the distribution can be elicited from specialist knowledge.
The main factors analyzed in the choice of a multicriteria decision support method decision are: the
problem analyzed, the context considered in the problem, the structure of the decision maker preferences and the
problematic. In this problem, analyzing the structure of decision maker preferences is assumed to be noncompensatory because of the difficulty of establishing a trade-off between the two criteria. Therefore, the
outranking concept should be used instead of additive aggregation of methods that consider one synthesis
function.
The context of the problem justifies the reason for choosing the method. The problem is framed within the
problematic of choice defined by Roy (1996). Fast use, easy interpretation by the decision maker and a flexible
comparison process were fundamental factors in choosing the method.
As a result of the features presented above, the PROMETHEE was the multicriteria decision support
method chosen. The PROMETHEE method (Preference Ranking Method Enrichment Evaluation) consists of
building and exploring a relationship of outranking values (Vincke (1992) and Brans & Mareschal (2002)). The
methods of the PROMETHEE family are used in multicriteria problems of the type:
Max { f1 ( x), f 2 ( x), ..., f k ( x) ∀ x ∈ A}
(1)
A is a denumerable finite set of n potential actions in f j (.), j = 1,2,..., k , k criteria that are the applications of A
in the set of real numbers. Each one of the criteria has its own units and they do not have limitations in the case
in that certain criteria are to maximize and others are to minimize.
The use of this method allows that decision maker preferences for each attribute are taken into
consideration and modeled by different functions for each attribute, called generalized criterion. Then, the
performances of each one of the alternatives are evaluated for each criterion and, later, the alternatives are
compared observing the differences of the performances modeled by the functions of generalized criterion. By
using procedures established for the method, the values of the scores of each the alternatives are calculated. The
choice of the best alternative is based on the alternative that attains the highest score.
In addition to the multicriteria methodology adopted in order to deal with two criteria simultaneously
Cm(T) and R(T), this was combined with the Bayesian analysis methodology in order to support situations
where data on failure is absent.
An interesting aspect of this problem is the conflict between the criteria. Reliability and maintenance cost
are in conflict from the origin of the time to the minimum point of the function Cm(T), denoted by Cm*(T).
However, for any alternatives of time starting from Cm*(T), an increment of the function Cm(T) provides a
decrease of the function R(T), which is a characteristic of dominance. Thus, the alternatives starting from this
point are dominated by the alternatives of time prior to this point. As a result, the alternatives starting from
Cm*(T) can be ignored. These aspects are shown in Figure 1.
31
Figure 1. Reliability and Cost of Maintenance
The estimate of the failure distribution should be obtained from historical data, values judgement or a
combination of both. Although the probability distribution of failures can be obtained directly, starting from the
data, there is difficulty in associating the data with the information necessary for planning replacement, in
addition to which the data sampling is not sufficient to obtain a reasonable estimate of the probability of failure.
An alternative approach for obtaining the model that describes the behavior of equipment failures, in the time
dimension, is to assume a failure distribution and then to estimate the parameters of this function. This approach
is justified because of the viability of using an appropriate mathematical model. The Weibull distribution was
selected to model the distribution of the times between failures, since it is a useful distribution in a variety of
applications, particularly, for the model of the life of devices. Besides the Weibull distribution being used
commonly to model failures of equipment, it is also flexible, and could be used for several types of data (Nelson
(1982) and Weibull (1951)). There follows the density function of failures of the Weibull distribution:
β −1
⎡ ⎛ x ⎞β ⎤
⎛β ⎞ ⎛x⎞
(2)
f ( x) = ⎜⎜ ⎟⎟ ⋅ ⎜⎜ ⎟⎟
⋅ exp ⎢− ⎜⎜ ⎟⎟ ⎥
⎢⎣ ⎝ η ⎠ ⎥⎦
⎝ η ⎠ ⎝η ⎠
When data is absent and consequently there is uncertainty in both parameters of this distribution, the
specialist knowledge can be used, through the concept of subjective probability. The parameters of the Weibull
distribution are considered random variables θ1 and θ2, the distributions of which should be assessed from
specialist knowledge on these variables π(θ1) and π(θ2), such knowledge being elicited (Raiffa (1970)).
In order to deal with the uncertainty on the parameters η and β, the computation of the criterion reliability,
R(t), considers the distributions π(η) and π(β), that follow the Weibull distribution. This criterion is denoted as
Z1(T), and is calculated for each alternative of time in accordance with the following equation:
t
Z 1 = E[ R (t )] = 1 −
+∞ +∞
∫ ∫ ∫ π (β )π (η ) f ( x)dβ dη dx
(3)
−∞ −∞ −∞
The second criterion, Z2(T), is represented by the Cost of Maintenance, Cm(T), in accordance with the
following expression:
c ⋅ (1 − E[ R (t )]) + cb ⋅ E[ R(t )]
Z 2 = Cm(t ) = t + ∞ + ∞ a
(4)
∫ ∫ ∫ x ⋅ π (β )π (η ) f ( x)dβ dη dx + t ⋅ E[ R(t )]
−∞ −∞ −∞
4. Numerical application
This section presents a numerical application in order to illustrate the model presented in the previous section. A
hypothetical example is generated based on values close to the reality of a company. The application is carried
out for a given piece of equipment with a view to planning preventive maintenance for it. The replacement
policy by age is suggested in accordance with the features of the equipment. The objective of the model is to
determine the most appropriate time between replacements. In addition, information on periods of failure and
the costs before failure (cb) and after failure (ca) are necessary for the application of this policy. These values are
presented in Table 1.
32
ca
cb
1000
200
β
η
β
η
π(β)
π(η)
5.40
3.15
1.80
6000
Table 1. Parameters for the decision model
The alternatives of time were generated by considering the interval between 500 and 3000 days with an
interval of 100 days between the alternatives. After generating the alternatives, the performances of these are
calculated for the two criteria. The performance of the alternatives is shown in Table 2.
The alternative of 2400 days is optimum for criterion Z2(T), then the alternatives with a time of more than
2400 days are ignored for they dominate.
T
Z1(T)
Z2(T)
T
Z1(T)
Z2(T)
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
0.98008
0.97010
0.95862
0.94599
0.93244
0.91809
0.90306
0.88745
0.87135
0.85482
0.83794
0.82079
0.80341
0.87698
0.76334
0.68669
0.63242
0.59265
0.56283
0.54007
0.52250
0.50884
0.49817
0.48982
0.48330
0.47825
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
0.78588
0.76823
0.75051
0.73279
0.71507
0.69741
0.67985
0.66239
0.64509
0.62795
0.61101
0.59427
0.57779
0.47435
0.47141
0.46926
0.46771
0.46672
0.46613
0.46590
0.46598
0.46628
0.46679
0.46747
0.46829
0.46917
Table 2. Performances of alternatives
In this context, through an interactive process between the decision maker and the decision analyst, the
function of generalized criterion (Fj(.)) is determined in order to model decision maker behavior vis-a-vis the
extent of the differences (dj(.)) among the evaluations for each criterion (fj(.)). In this way, the indifference and
preference thresholds are estimated. In Table 3, the function type of generalized criterion and its respective
parameters are shown.
Z1(T)
Z2(T)
Objective
Weight
Maximize
Minimize
0.42
0.58
Preference
Function
Linear
Linear
Indiference
Threshold
0.030
0.015
Preference
Threshold
0.100
0.200
Table 3. Parameters for the decision model
The PROMETHEE II method is used, thus allowing through the difference between the positive and
negative flows to establish a complete pre-order among the alternatives evaluated. As a result, the application of
the multicriteria decision support method generates the ranking of the alternatives arranged from the best to the
worst. See Table 4. According to PROMETHEE II, 1200 days is the model´s solution. Sensitivity analysis for
the weight of the criteria was carried out, and the weight of the criterion Z1(T) can vary in the interval between
34.33% and 56.88%, without affecting the model’s solution.
T
Φ+
Φ−
Φ
T
Φ+
Φ−
Φ
1200
1100
1300
1000
1400
1500
900
1600
1700
1800
0.3054
0.3136
0.296
0.3204
0.2854
0.2731
0.3223
0.2586
0.2427
0.2254
0.1214
0.1317
0.1256
0.1635
0.1352
0.1499
0.2224
0.1687
0.1919
0.2152
0.1841
0.182
0.1704
0.1569
0.1502
0.1231
0.0998
0.0898
0.0508
0.0102
800
1900
2000
2100
700
2200
2300
2400
600
500
0.3237
0.2072
0.1915
0.1808
0.3247
0.1749
0.1742
0.1745
0.3188
0.3169
0.3182
0.2381
0.2608
0.2833
0.4441
0.3056
0.3277
0.3496
0.5129
0.5641
0.0055
-0.0309
-0.0693
-0.1026
-0.1193
-0.1307
-0.1536
-0.1751
-0.194
-0.2473
Table 4. Results
33
5. Conclusions
This work presents a proposition of applying a multicriteria decision model to preventive maintenace planning.
This model dealed with the periodicity of the replacement for a certain item based on more than one criterion,
and in the absence of data on failures, in order to provide appropriate support for the decision maker to
determine the most opportune time to substitute an item.
In addition, in relation to the thematic proposal, preventive maintenance has deserved more necessary
investigation, to denote its applications, concerning replacement policies and problems arising. Under the form
of bibliographical research, important works could be observed, especially ones which, under very different
focuses, tackle the replacement of equipment.
In conclusion, the proposition of the model supports one of the great concerns of the maintenance
structures in organizations that have large sums tied up in fixed assets in their production plants.
Acknowledgements
This work is part of a research study funded by the Brazilian Research Council (CNPq).
References
Almeida, A. T. de and Cavalcante, C. A. V. (2004) Multicriteria decision approaches for selection of preventive maintenance intervals,
MIMAR 2004 - 5th IMA International Conference on Industrial Maintenance and Reliability, Salford
Barlow, R. E. and Proschan, F. (1965) Mathematical Theory of Reliability, John Wiley & Sons
Barlow, R. E. and Proschan, F. (1975) Reliability and Life Testing Probability Models, Holt, Rinehart and Winston
Brans, J. P. and Vincke, P. (1985) A preference ranking organisation method: (The PROMETHEE method for multiple criteria decisionmaking), Management Science, 31 (6), 647-656
Brans, J. P. and Mareschal, B. (2002) Promethee-Gaia, une Methodologie dÁide à la Décision em Présence de Critères Multiples,
Editions Ellipses, Bruxelles
Brint, A. T. (2000) Sequential inspection sampling to avoid failure critical items being in an at risk condition, J. of the Operational
Research Society, 51 (9), 1051-1059
Chareonsuk, C., Nagarura, N. and Tabucanona, M. T. (1997) A multicriteria approach to the selection of preventive maintenance
intervals, International J. of Production Economics, 49 (1), 55-64
Dekker, R. and Scarf, P. A. (1998) On the impact of optimisation models in maintenance decision making: the state of the art, Reliability
Engineering and System Safety, 60, 111-119
Glasser, G.J. (1969) Planned replacement: some theory and its application, J. of Quality Technology, 1 (2), 110-119
Gopalaswamy, V., Rice, J. A. and Miller, F. G. (1993) Transit vehicle component maintenance policy via multiple criteria decision
making methods, J. of the Operational Research Society, 44 (1), 37-50
Kralj, B. and Petrovic, R. (1995) A multiobjective optimization approach to thermal generating units maintenance scheduling, European
J. of Operational Research, 84 (2), 481-493
Lam, Y. (2006) A geometric process maintenance model with preventive repair, European J. of Operational Research. In Press
Lotfi, V. (1995) Implementing flexible automation: a multiple criteria decision making approach, International J. of Production
Economics, 38 (2-3), 255-268
Martz, H. F. and Waller, R. A. (1982) Bayesian Reliability Analysis, John Wiley & Sons, New York
Makis, V. and Jardine, A. K. S. (1992) Optimal replacement in the Proportional Hazards Model, INFOR, 30 (2), 172
Mauer, D. C. and Ott, S. H. (1995) Investment under uncertainty: the case of replacement investment decisions, J. of Financial and
Quantitative Analysis, 30 (4), 581-605
Nelson, W. (1982) Applied Life Data Analysis, Wiley & Sons
Percy, D. F. and Kobbacy, K. A. H. (2000) Determining economical maintenance intervals, International J. of Production Economics, 67
(1), 87-94
Percy, D. F., Kobbacy, K. A. H. and Fawzi, B. B. (1997) Setting preventive maintenance schedules when data are sparse, International J.
of Production Economics, 51 (3), 223-234
Quan, G., Greenwood, G. W., Liu, D. and Hu, S. (2007) Searching for multiobjective preventive maintenance schedules: combining
preferences with evolutionary algorithms, European J. of Operational Research, 177, 1969–1984
Raiffa, H. (1970) Decision Analysis, Addison-Wesley
Roy, B. (1996) Multicriteria Methodology Goes Decision Aiding, Kluwer Academic Publishers.
Silver, E. A. and Fiechter, C. N. (1995) Preventive maintenance with limited historical data, European J. of Operational Research, 82
(1), 125-144
Weibull, W. (1951) A statistical distribution function of wide applicability, J. of Applied Mechanics, 18, 293-297
34
Predicting the performance of future generations of complex repairable systems,
through analysis of reliability and maintenance data
T. J. Jefferis*, 1, N Montgomery 2, T Dowd 3
1. Defence Science and Technology Laboratory, Farnborough, UK
2. Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, Canada
3. Department of Aerospace Engineering, University of Bristol, Bristol, UK
[email protected]
Abstract: Can the engineering performance of future generations of aircraft be predicted from current/historic
data? In this paper data on the operation of sixteen types of Royal Air Force military aircraft gathered between
1983 and 2002 are analysed. The usage, failure rates and maintenance effort are considered for individual types,
and comparisons are made between types with nominally similar operational profiles and between aircraft of
different ages. Aircraft size and production cost are investigated as potential predictors of maintenance burden,
with encouraging results.
1. Introduction
For many years the Royal Air Force (RAF) has centrally collected data on the usage, reliability and associated
corrective maintenance burden of the fixed wing and rotary wing aircraft that it operates. From 1970 this was
achieved through a system named (rather unoriginally) Maintenance Data System (MDS). In common with most
systems of that age MDS required manual completion of record cards for each relevant fault and the related
maintenance actions, which were forwarded to a central processing cell for input. The capture of usage data
followed a similar process. MDS was very well designed and has proved to be a useful and flexible system for
over 30 years, but advances in both hardware and software mean that a similar system could now be
implemented more flexibly and effectively using different techniques. At present the capture of platform usage
and maintenance data is being transferred to a new system, the Logistics Information Technology System
(LITS). This switch-over began in 2003 and various changes in the scope of the faults which are reported mean
that comparisons between MDS and LITS data are problematic. Due to this difficulty and also because of the
potential for data to have been lost during the switch between the two systems it was decided to limit this study
to the data held on MDS.
The data stored in MDS and LITS is widely used by the project teams responsible for the management of
individual aircraft types to examine fleet usage issues, to prioritise options for improving aircraft availability
and for other management tasks. However, as far as is known, the subject data has never been used to address
the broader issues raised in this paper. Recent publications in the area of fleet-wide reliability and
maintainability have tended to focus on new reliability metrics such as the maintenance and failure free
operating periods. See Kumar (1999) and Hockley & Appleton (1997), for example. These studies focus on the
design requirements of new aircraft. We have found no published examples of using historical data to examine
overall trends in reliability.
The data used in this study was for the calendar years 1983 – 2002 (inclusive) and covers sixteen types of
fixed wing aircraft of various vintages (see Table 1). The first six of these aircraft types (Tornado IDS to
Canberra) can be classified as ‘Fast Jets’ and represent the RAF’s front line aircraft.
Aircraft Type
Tornado IDS1
Tornado ADV2
Harrier3 II4
Jaguar
Phantom
Canberra
Hawk
Tucano
Jetstream
First Flight of Production
Standard Aircraft
1977
1984
1987
1972
1960
1950
1975
1983
1968
Year of Entry to
RAF Service
1982
1987
1987
1973
1968
1951
1976
1988
1973
1
Interdictor Strike, GR1 and GR4
Air Defence Variant F2 and F3
3
Based upon HS Harrier, first production flight 1969
4
GR5/GR7/GR9 standard
2
35
Production Cost / Unit,
relative to Tornado IDS
1
1.29
.76
.50
.71
.36
.26
.07
.27
Max Take Off Weight
(kg)
28000
27896
14061
15700
28030
24198
9100
3275
6950
Dominie5
Hercules K6
Nimrod7
Sentry
Tristar
VC-10
1963
1964
1967
1975
1971
1964
1965
1967
1969
1991
1984
1981
.33
.82
2.31
n/k
1.75
3.15
12700
70300
87090
156150
225000
151900
Table 1. Aircraft Types, Relevant Dates, Cost and Weight
The next four (Hawk to Dominie) can be classified as ‘Trainers’ and represent the aircraft used by the RAF8 as
part of their training processes (generally of aircrew). The final five types (Hercules K to VC-10) can be classed
as ‘Large Aircraft’, which include Tankers, Transport Aircraft and Patrol Aircraft.
2. Data gathering
As MDS contains detailed data on every unscheduled maintenance action undertaken on the subject aircraft a
high degree of resolution can be achieved, where the situation warrants it, limited only by the integrity of the
underlying data. However, as this particular study was looking for long-term trends the data for each aircraft
type was grouped into six month periods, Jan 1 – 30 June and 1 July – 31 December for each calendar year.
Within these periods the following relevant data elements were extracted:
• Aircraft type
• Total fleet flying hours
• Arisings9 per flying hour
• 1st, 2nd, 3rd/4th line10 and total maintenance man-hours per flying hour11
More detailed data has also been extracted for the period 1997 – 2005, which shows the contribution of
different sub-systems to the Arising, No Fault Found and Adverse Operational Effect rates. It is hoped that, in
due course, this will be utilised to further investigate the causes of differences in maintenance burden between
aircraft types.
As discussed below, during analysis of the MDS data it was found desirable to identify potential surrogate
measures of system complexity, ideally measured against the current ‘state of the art’ when the aircraft was
designed. Two relatively objective measures were identified: aircraft production cost and the aircraft maximum
take-off weight, adjusted to constant economic conditions and constant production numbers. All of the data
considered in this study is of some sensitivity. Therefore both the MDS data and the cost data have been scaled
to conceal the absolute values, whilst maintaining the same relative relationships.
3. Initial data review and analysis
The two measures of maintenance burden contained in the data extracted from MDS are the number of arisings
and the effort that is required to fix them. The relevant values for fast jets are shown below in Figures 1 and 2.
Consideration of these two measures shows that their values must be connected, as the overall maintenance
effort (MMH/FH) is defined as the number of arisings, multiplied by the average time to repair. It is therefore
no surprise to see that there are similarities between the two figures. In each case the Tornado IDS, Tornado
ADV and Phantom are the three highest values, with the Harrier II, Jaguar and Canberra being the lowest three.
One can also observe that, if the initial entry-into-service upturns and the transfer-to-LITS downturns are
excluded, then there are no common, time-related trends. This might initially appear surprising, as conventional
wisdom is that the maintenance burden of every aircraft increases with time, however, the long-term trend will
actually reflect factors such as the relative effectiveness of maintainer ‘learning’, effectiveness of the systems
improvements, which are undertaken in a attempt to improve Reliability and Maintainability, as well as the
effects of aging.
5
Version of HS.125 Business Jet. Current production version named Raytheon Hawker.
Based upon Lockheed Hercules A, first production flight 1956
7
Based upon the de Havilland Comet 4, first flight 1958
8
Data also includes Royal Navy Jetstreams
9
Arisings are events that require unscheduled maintenance actions (i.e. Faults)
10
During the data collection period RAF maintenance was divided into four ‘lines’. 1st Line and 2nd Line were undertaken
at the Operating Bases, with 3rd and 4th Line being at Depot or in Industry.
11
MMH/FH
6
36
1.6
1.4
1.2
Tornado IDS
Tornado ADV
1
Harrier II
0.8
0.6
Jaguar
Phantom
0.4
0.2
Canberra
Jan-01
Jan-99
Jan-97
Jan-95
Jan-93
Jan-91
Jan-89
Jan-87
Jan-85
Jan-83
0
Figure 1. Fast Jet Arisings per Flying Hour
(Actual values normalised to Tornado IDS Average = 1)
1.4
1.2
Tornado IDS
1
Tornado ADV
0.8
Harrier II
0.6
Jaguar
0.4
Phantom
0.2
Canberra
Jan-01
Jan-99
Jan-97
Jan-95
Jan-93
Jan-91
Jan-89
Jan-87
Jan-85
Jan-83
0
Figure 2. Fast Jet Maintenance Man Hours per Flying Hour
(Actual values normalised to Tornado IDS Average =1)
A similar lack of a common trend is seen in the graphs for the fleets of Training Aircraft and Large Aircraft,
which are shown in figures 3 and 4. Investigation shows no relationship between the age of the design (i.e. when
the aircraft type first flew) and the maintenance burden, for any of the classes of aircraft, however many of the
aircraft fleets show non-zero gradients in the trend in the maintenance burden over time and these are shown in
Table 2. It is possible that differences in the histories of these fleets (e.g. implementation of modification
programmes) may explain some or all of these differences, but this must be left as a future investigation. At
present we may simply observe that most aircraft fleets exhibit some significant time-related trend in the
maintenance effort required to support them and that this trend can be increasing or decreasing.
0.6
0.5
Hawk
0.4
Tucano
0.3
RAF Jetstream
0.2
All Jetstream
Dominie
0.1
Jan-01
Jan-99
Jan-97
Jan-95
Jan-93
Jan-91
Jan-89
Jan-87
Jan-85
Jan-83
0
Figure 3. Training Aircraft Maintenance Man Hours per Flying Hour
37
1.4
1.2
Hercules K
1
Nimrod
0.8
Sentry
0.6
Tristar
0.4
VC10
0.2
Jan-01
Jan-99
Jan-97
Jan-95
Jan-93
Jan-91
Jan-89
Jan-87
Jan-85
Jan-83
0
Figure 4. Large Aircraft Maintenance Man Hours per Flying Hour
Aircraft
Tornado IDS
Ditto ADV
Harrier II
Jaguar
Phantom
Canberra
Hawk
Tucano
Linear Correlation
-0.455
0.365
0.824
-0.33
-0.233
0.203
-0.802
-0.456
p-value
0.003
0.026
0.000
0.037
0.368
0.291
0.000
0.011
Aircraft
RAF J’stream
All Jetstream
Dominie
Hercules K
Nimrod
Sentry
Tristar
VC10
Linear Correlation
0.017
0.607
-0.33
-0.56
-0.833
0.182
0.439
0.601
p-value
0.915
0.000
0.037
0.000
0.000
0.443
0.009
0.000
Table 2. Time trends of Maintenance Man Hours per Flying Hour
4. Analysis of aircraft size and complexity
In the absence of sufficient data to explain the time-related trends in maintenance burden the average value of
Maintenance Man Hours per Flying Hour is now used as a surrogate for each aircraft type’s maintenance burden,
so that the relationship between this and aircraft size and complexity, as represented by Maximum Take-Off
Weight and Production Cost, respectively, can be investigated.
As might be expected it is found that the Large Aircraft, which are mainly modified airliners do not follow
the same relationships as smaller military aircraft. Figure 5 therefore only includes data on Fast Jets and
Trainers. A linear regression on this data gives a good fit, with R2=0.8002 for the equation:
Normalised Average MMH/FH = 3.459x10-5 MTOW – 0.0628
The relationship between Cost and MMH/FH for the same aircraft gives similar results, but in this case the
relationship is slightly less good, with R2=0.746. As might be expected, there is a correlation between Aircraft
Weight and Cost, which is especially strong for the smaller aircraft. Because of this relationship the joint
regression, including both Weight and Cost provides only a fractionally better predictive relationship (R2=0.825)
than Weight alone. However, as prediction of the costs of future aircraft is difficult and contentious, the
relationship which only requires weight is preferred.
MMH/FH (Normalised)
1.2
1
0.8
0.6
0.4
0.2
0
0
5000
10000
15000
20000
25000
30000
Max Take Off Weight (kg)
Figure 5. Max Take Off Weight vs Av MMH per Flying Hour
38
5. Conclusions
Based upon the data obtained from the RAF’s Maintenance Data System it is clear that long-term trends do exist
in the maintenance burden associate with maintaining the different aircraft fleets. However it has been found
that, perhaps contrary to expectations, some of these trends are positive and some negative. The engineering
history of the aircraft fleets will be further investigated to see whether these differences can be explained in
terms of modification programmes, maintenance learning etc.
With the relatively high-level data available it has been possible to derive relationships to predict the
maintenance burden of future Fast Jets and Trainer aircraft. On this basis of the best of these it is predicted that
the Typhoon, with its Max Take Off Weight of 23,500 kg, should require about 82% of the maintenance that the
Tornado ADV, which it replaces. Similarly, it is predicted that the STOVL JSF will require about 33% more
maintenance than the Harrier, which it replaces. Unfortunately, the variability in the data makes the prediction
intervals rather wide. Naturally these predictions do not take account of any significant changes to construction
techniques, such as greater use of composites, or to maintenance practices, such as the incorporation of
Prognostics, all of which could drastically change the maintenance burden.
References
Hockley, C. J. and Appleton, D. P. (1997) Setting the requirements for the Royal Air Force's next generation aircraft, Reliability and
Maintainability Symposium, 1997 Proceedings, Annual
Kumar, U. D. (1999) New trends in aircraft reliability and maintenance measures, J. of Quality in Maintenance Engineering, 5, 287-295
Crown Copyright 2007. Published with the permission of the Defence Science and Technology Laboratory on behalf
of the Controller of HMSO.
39
A note on a calibration method of control parameters in simulation models for
software reliability assessment
Mitsuhiro Kimura
Department of Industrial & Systems Engineering, Faculty of Engineering, Hosei University, 3-7-2, Koganeishi,184-8584, Tokyo, Japan
[email protected]
Abstract: In this study, we try to develop a calibration method of control parameters included in a Monte Carlo
simulation model which generates time series data for software reliability assessment. The control parameters
are calibrated so as to minimize the mean sum of squared errors between the actual observed data and the
generated ones. As a result, the proposed method enables us to use the simulation model for assessing the
software reliability based on the observed data. We show several numerical examples of the calibration and
software reliability assessment by the simulation models.
1. Introduction
In the literature of software reliability assessment modelling (see, e.g. Pham (2000)) which is based on software
reliability data, we can find a lot of stochastic process models. Nothing to say, such stochastic models can give
us several useful reliability assessment measures such as software reliability, MTBF, and so on (Lyu (1995)).
However, these models also have several strong assumptions in general. For instance, although
nonhomogeneous Poisson process (NHPP) models have been widely known as software reliability assessment
models, these ones cannot deal with occurrences of simultaneous multiple fault removal phenomena which
naturally occur in a real software testing phase. In order to overcome this unnatural assumption of NHPP models
(Musa (1998)), for example, compound Poisson process models have been proposed (Sahinoglu (1992)).
However it is known that the estimation of unknown parameters of the compound Poisson process models is
difficult.
Let us explain it in a more general sense. We here assume that there is a time series data set (ti , di )
(i = 1, 2,3,..., I ) which was observed from a non-deterministic system. di is an outcome from the system when
the system time is ti . Usually it is natural that di is treated as a realization of some stochastic processes with a
given time point ti . Therefore if we can find or develop an appropriate stochastic process model for the data set,
we might be able to estimate unknown parameters included in the model and forecast the future behaviour of
(ti , di ) (i = n + 1, n + 2,...) with several stochastic properties. However, in some cases, since a few stochastic
models need to have several unrealistic and/or strong assumptions in order to make the models mathematically
analyzable, such models might not describe real phenomena, or the models have difficulties in terms of
parameter estimation.
In contrast, Monte Carlo simulation models have more flexibility. The models can straightforwardly
represent relatively complicated phenomena which are too tough for stochastic modelling in general. However,
since simulation models run in one direction with previously fixed control parameters which are included in the
models, and we are only able to obtain the execution results of the simulation. That is, although we can
investigate the sensitivity of the model with respect to the control parameters, we usually do not estimate or
calibrate control parameters inversely from the actual observed data.
In this study, we propose a method of calibrating control parameters in Monte Carlo simulation models
based on the least squares rule. The models appear in software reliability assessment modelling, and especially
we consider the models that simulate the time-behaviour of software failure occurrences during the software
testing phase.
2. Method of calibration
Let us consider a Monte Carlo simulation program f (called simulator in this study, for short) which outputs a
time series data set along with the given time points ti (i = 1, 2,3,..., I ) . Thus we represent f by f (ti | θ) where
θ = {θ1 ,θ 2 ,θ3 ,...,θ m } is a vector consists of m control parameters included in the simulator. Since the
mechanism of the simulator has some stochastic property, we can obtain one result vector f as a realization of
the simulation. The vector can be shown by
f = ( f (t1 | θ), f (t2 | θ), f (t3 | θ),..., f (t I | θ))
40
(1)
If we execute J times this simulation with a fixed θ , we consequently have a set of result vectors as
{f1 , f 2 , f3 ,..., f J }
(2)
One of our main purposes is to calibrate the parameter vector θ in order that the simulator well emulates the
observation (ti , di ) (i = 1, 2,3,..., I ) which was defined in Section 1.
Let SE j (θ) be the sum of squared errors for j -th simulation result f j , that is
I
SE j (θ) = ∑ ( f (ti | θ) − di ) 2
(3)
i =1
where j = 1, 2,3,..., J . Since f j behaves as random, SE j (θ) also fluctuates. Therefore, we have a sample mean
MSE (θ) as
J
MSE (θ) = ∑ SE j (θ) / J
(4)
j =1
Consequently, we formulate the calibration problem of θ as
θ%
= min MSE (θ)
(5)
θ
denotes a vector of calibrated parameters obtained. However, since MSE (θ) is not differentiable with
where θ%
respect to θ , finding θ%
is not a very easy task.
3. Numerical examples
In this section we show several numerical examples. We define the following:
• S k (k = 1, 2,3,..., K ) is the realization of the kth software fault detection time
•
K is the number of detected software faults
Table 1 and Figure 1 represent a sample data set to be analyzed (denoted by DS-1), which is cited from
Goel & Okumoto (1979).
Fault number
Sk (days)
Fault number
1
2
3
4
5
6
7
8
9
10
11
12
13
9
21
32
36
43
45
50
58
63
70
71
77
78
14
15
16
17
18
19
20
21
22
23
24
25
26
Sk
S_k
(days)
250
87
91
92
95
98
104
105
116
149
156
247
249
250
200
150
100
50
k
5
10
15
20
25
Figure 1. Plot of the data set (DS-1)
Table 1. Software fault detection time data (DS-1)
3.1 One-parameter simulator for DS-1
We assume a simple black-box test (Pham (2000)) for the testing process which DS-1 was observed. A
simulation algorithm of the black-box test is presented as follows.
Step 1: Prepare a one-dimensional array variable p[1…psize] which represents a program to be tested, where
psize is a positive integer number arbitrarily given, and it is related to the program size of the software system.
Step 2: Set zi = 0 for i = 1, 2,3,..., S K (i.e., I = S K ).
Step 3: For l = 1 to l = psize , set
p[l ] = {
1 with probability θ1
0 with probability1 − θ1
41
(6)
where ‘1’ represents a code including a software fault and ‘0’ is a clean one, and θ1 denotes unreliability per
code.
Step 4: Set i = 1 .
Step 5: Choose an integer number c randomly from the range of [1…psize]. If p[ c ] = 1 then zi = zi + 1 and set
p[ c ] = 0.
Step 6: If ti < S K then i = i + 1 and go to Step 5, else Step 7.
Step 7: Return f1 , i.e., f (ti | θ) = zi (i = 1, 2,3,..., S K ) .
As a result of the above, we have a simulated fault occurrence data as (ti , f (ti | θ1 )) (i = 1, 2,3,..., S K ) .
Figure 3 illustrates a sample path of f1 , where we set arbitrarily psize = 500 and θ1 = 0.15 .
3.1.1 Transformation of DS-1
In order to calculate equation 3, we need to transform DS-1 to the appropriate form (ti , di ) by following the
procedure listed below.
Step 1: S0 = 0 , j = 1 , and k = 1 .
Step 2: If k satisfies S j −1 ≤ k < S j then tk = k , d k = j − 1 , k = k + 1 , go to Step 2, else Step 3.
Step 3: j = j + 1 . If j < S K then go to Step 2 else Step 4.
Step 4: tSk = S k , d SK = K .
We present the result of the transform for DS-1 in Figure 4.
3.1.2 Calibration
Now we are ready to calibrate the parameter of f . By letting J = 100 , we can obtain MSE (θ1 ) in equation (4)
for a given θ1 . Therefore we have calculated MSE (θ1 ) with changing the value of θ1 . The results are plotted in
Figure 5.
Figure 5 is plotted with a fitted quadratic function. In order to obtain θ1 in equation (5), we have simply
) = 5485.04 . By using this result, we
used the fitted quadratic curve. Thus we have θ1 = 0.14465 and MSE (θ%
depict mean behaviour of the calibrated simulator f for 100 iterations with θ1 and 95% confidence intervals of
the mean which were calculated under the assumption of normal distribution for variation around the mean
value in Figure 6.
3.2 Two-parameter simulator as imperfect debugging model
In this section, we extend the one-parameter simulator discussed in the section 3.1 by adding an imperfect
debugging factor. There are several software reliability assessment models which take into account of imperfect
debugging phenomena (Pham (2000)). However it is known that almost all of the imperfect debugging models
are difficult to estimate the model parameters.
Figure 3. Sample path of the one-parameter simulator
Figure 4. Transformed data of DS-1
42
On the other hand, we can easily expand the one-parameter simulator by replacing Step 5 by Step 5’ as
Step 5’: Choose an integer number c randomly from the range of [1..psize]. If p[ c ]=1 then zi = zi + 1 and set
p[ c ]=0 with probability 1 − θ 2 , else go to Step 6 with probability θ 2 (0 ≤ θ 2 ≤ 1) .
In this case, we consider two control parameters θ1 and θ 2 , where θ1 means unreliability per program code and
θ 2 denotes imperfect debugging rate per debugging activity, respectively. The evaluation result of MSE (θ) is
illustrated in Figure 7. In order to find ( θ1 , θ 2 ), we simply search the minimum value of MSE (θ) without
fitting any curved surface on the plot. Thus the calibration results of the parameters are ( θ1 , θ 2 ) = (0.162, 0.20),
and MSE (θ%
) = 3960.57 . We plotted the fitted simulator and the actual data with the 95% confidence intervals
in Figure 8.
As a measure of goodness-of-fit, we calculate (pseudo) AIC (Akaike (1974)) for these two models as
AIC (one-parameter simulator) = 774.1
AIC (two-parameter simulator) = 690.7
AIC shows us that the two-parameter simulator is better than one-parameter model for DS-1.
3.2.1 Software reliability assessment measures
We propose a quantitative measure for software reliability assessment as
psize
Number of faults after debugging
Number of faults before debugging
=
∑ p[l ] at t
l =1
psize
SK
∑ p[l ] at t
l =1
(7)
S0
We call this ratio the residual fault ratio and denote it by FR j , where j = 1, 2,3,..., J and J represents the
number of iteration of the simulation. By using FR j , we can estimate the number of latent faults in the software
before the testing phase, LF j . It is shown by
LFj =
K
1 − FR j
(8)
where K represents the total number of actually detected software faults given by the data set. More directly, we
are able to assess the fault removal ratio R j by
R j = 1 − FR j
(9)
Figures 9 and 10 illustrate the histograms of LF j and R j with J = 100 , respectively. Their mean values are
E[LF ] = 80.95 and E[R]=0.3291 . The fault removal ratio estimated is quite low for DS-1.
Figure 5. Behaviour of MSE and fitted quadratic curve
Figure 6. Calibrated one-parameter simulator
43
d_i
Fitted +1.96*SD
35
30
80000
25
60000
MSE
40000
250
0
theta1 * 1000
10
10
5
50
20
theta2 * 100
Fitted
15
150
100
Actual
20
200
20000
Fitted-1.96*SD
50
30
Figure 7. Behaviour of MSE of two-parameter model
100
150
200
250
t_i
Figure 8. Calibrated two-parameter simulator
4. Concluding remarks
This article has proposed the calibration method of control parameters included in Monte Carlo simulation
models based on the least squares rule. We have applied this method to simple simulation models for software
reliability assessment. The proposed method can find the optimal calibrated simulator by a heuristic way
discussed in this study, however, we need to develop an effective method for finding θ%
, when the number of
control parameters becomes large.
In the future, we also need to investigate the size effect of these simulation models. That is, in section 3, we
set the size of the array variable psize = 500 arbitrarily. However, it must be the optimal size for doing the
accurate reliability assessment by the calibrated simulator. In this sense, the estimation result of fault removal
ratio R j in the previous section might be estimated defectively.
Acknowledgment
This work was partially supported by the Japan Society for the Promotion of Science, Grant-in-Aid for
Scientific Research (C), 18500066, 2006-2007.
References
Akaike, H. (1974) A new look at the statistical model identification, IEEE Trans. Automatic Control, AC-19, 716–723
Goel, A. L. and Okumoto, K. (1976) Time-dependent error-detection rate model for software reliability and other performance measures,
IEEE Trans. Reliability, R-28, 206–211
Lyu, M. (1995) Handbook of software reliability engineering, IEEE Computer Society Press, Los Alamitos
Musa, J. D. (1998) Software Reliability Engineering, McGraw-Hill, New York
Pham, H. (2000) Software Reliability, Springer-Verlag, Singapore
Sahinoglu, M. (1992) Compound Poisson software reliability model, IEEE Trans. Software Engineering, 18, 624–630
Freq .
Freq .
20
20
15
15
10
10
5
5
LF_j
60
80
100
120
Figure 9. Histogram of LF_j (J=100)
R_j
0.25
0.3
0.35
0.4
0.45
Figure 10. Histogram of R_j (J=100)
44
Estimating the availability of a reverse osmosis plant
Mohammed Hajeeh *, Fatma Faramarzi, Ghadeer Al-Essa
Kuwait Institute for Scientific Research, P.O. Box 24885, Safat-13109, Kuwait
[email protected]
Abstract: This paper presents an assessment of a reverse osmosis (RO) plant in Kuwait by analyzing its operational
and downtime patterns. The plant is divided into main subsystems and the performance of each subsystem is
derived. The overall performance of the RO plant was assessed from the performance of its subsystems. Assessment
of the operational time of was considered to be more appropriate than other performance measure since the plant
was deigned to operate continuously. The plant subjective assessment of failure probabilities of subsystems was
made wherever detailed data were not available. The overall unavailability of the RO plant with and without is
around 1.87 days/year and 0.9 days/year, respectively.
Keywords: Performance measures, operational time, availability, failure probabilities
1. Introduction
Fresh water is essential for life and living species. Many countries have abundant fresh water supplies, while
others have limited resources. The problem of the scarcity of fresh water supplies is apparent in the Gulf
Cooperation Council (GCC) countries where fresh water resources are below poverty levels. In these countries,
the fresh water demand has increased from 4.25 billion cubic meters (bm3) in 1980 reaching 29.3 bm3 in 2000,
Ebrahim & Abdel-Jawad (1994). Therefore; desalination technologies have been used extensively in these
countries to produce fresh water to cover the progressive increase in demand. The GCC region accounts for
around 45 percent of total desalination capacity in the world, Parekh (1988). Commercially available
desalination techniques are categorized into two types, i.e., distillation and membrane–based technologies. The
distillation processes transform water into vapor then condense it into a liquid state. This process requires power
in the form of thermal and electrical energy. Commercially available desalination techniques include multistage
flash (MSF), multi-effect desalination (MED), and vapor compression (VC). Membrane–based desalination
techniques consume power in the form of mechanical or electrical energy. Two processes under this category
are commonly used, i.e., reverse osmosis (RO), and electrodialysis (ED). However, the latter is mainly for
brackish water desalination. Although several desalination technologies are used in the GCC, MSF is dominant
and it accounts for approximately 80 per cent of the world’s plants.
RO has been considered a successful process for desalination of brackish water and seawater, Ebrahim &
Abdel-Jawad (1994), Parekh (1988). The first major breakthrough in commercial application of RO came in
1975 when Dow Chemical, Du Pont and Fluid Systems developed large-scale RO modules for the Office of
Water Research and Technology, USA. Considerable amount of interest and research in the RO process
throughout the world has been in evidence since that time. Today RO is considered to be a powerful process for
the removal of various dissolved solids, thereby generating the ultrahigh purity in water needed for the
pharmaceutical industry, research laboratories, haemodialysis, etc. RO has also assumed a prominent role in
freshwater production, because of its unique ability to remove ionic impurities, colloids, organic,
microorganisms, and pyrogenic materials.
The most important subsystems of the RO plant are semipermeable membranes, filters, high-pressure
pumps, feed-water pre-treatment system, and product-water post-treatment system (figure 1).
High
Pressure
Pump
Saline
Feed Water
Membrane
Assembly
Fresh
Water
Fresh Water
PRETREATMENT
POSTTREATMENT
Concentrate
Discharge
Figure 1. A schematic presentation of the reverse osmosis plant
The plant under study receives feed from either of the two beachwells located short distance from it. Each
beachwell is fitted with a submerged pump; each of the pumps has a capacity of pumping seawater at a rate of
72 m3/h. Before feeding seawater to the RO systems, it is pretreated to eliminate any coarse pollution matter
and biofoulants. Additional treatments including chlorination, filtration and antiscalant addition, dechlorination
45
are provided to ensure the long service of the RO module system. The design temperature is 250C (220C
minimum and 350C maximum). Fine filtration is done through 5-µm cartridge filters in the final filtration stage.
A high-pressure pump then pressurizes this high-purity filtrate to the required pressure level (60 to 65 bars) for
the desalination process. Two types of RO membranes (one each for a train) are used, spiral-wound and hollowfiber twin, each with a capacity of 300 m3/d. The designed recovery rate has been fixed at 35%. The product
water (i.e., high-purity water) is taken out at the end of the trains and sent for post-treatment (addition of
minerals to make the water potable). The concentrated brine coming out from the system has a high pressure
level (up to 50 to 60 bars). It is allowed to pass through the energy recovery system (a Pelton wheel turbine).
The RO process is permanently monitored and volumetrically controlled to comply with the predetermined
parameters. The membranes are cleaned at intervals depending on the actual service conditions; the
cleaning/flushing equipment is made of suitable materials and comprised of a solution tank with motorized
agitation, a pump, and cooling and heating facilities equipped with all necessary instruments such as
temperature and pressure gauges and flow meters. The membranes are preserved with formaldehyde solution, if
the shutdown period is longer than 4 or 5 days. Before the units are restarted, the formaldehyde solution is
removed from the system and collected in the cleaning solution tank.
The pH of the product water is maintained at 7.5 to 8.2 by the addition of bicarbonate ions (using a
dolomite/limestone dissolution filter). The equipment, materials, and instruments (except the membranes) have
a working life of 20 years when working continuously (90% availability) or intermittently at variable outputs.
The layout and control of the plant and equipment ensure easy operation and the use of minimum manpower
requirements. The central control panel for the plant enables the operator to start and shutdown the plant
partially or completely. For minimum breakdown, the design specifications for all the subsystems of the plant
were reviewed. Since the plant was a new plant and several other research objectives were attached to the
operation parameters and performance of the plant, only the parameters relevant to the project were reviewed.
In the RO process, the quality of feed is extremely important for the life of the RO membranes. The feed
should be free from all suspended particles, and this is ensured by the filtration process. In the existing plant, the
standby filters remove any possibility for failure due to malfunctioning of the online filter in the system. Hence
for any future RO plant, it is recommended that standby filters be installed. The quality of seawater feed also
determine the need for acid dosing, NaHSO4 dosing and addition of antiscalant. Normally the failure of a system
rarely occurs due to failures of the pumps used for these pretreatments of the feed unless human error intervenes
in the dosing activity. The availability of the RO plant, therefore, greatly depends on the performance of the
high-pressure pumps, membranes and their housings, various seals at all the junctions, and the dial
gauges/indicators recording the various parameters. The thickness and material for the membrane housing
should be chosen based on the maximum pressure obtained from the high-pressure pump and the desired safety
factor level. Before housing the membranes, any crack or non-uniformity of thickness of the shell should be
checked. The high-pressure pump should have the highest reliability; hence, selection of the pump should be
made very carefully. There should be a proper preventive maintenance plan for the high-pressure pump to
reduce the probability of its failure during operation. The membranes should be cleaned following the procedure
recommended by the supplier. If membrane is cleaned from time to time, both the quantity of the product water
increases and the membrane life improves. To obtain higher availability of the plant, all the important
parameters like quality of feed, amount of flow, pH, feed concentration, feed temperature, SDI, high-pressure
pump outlet pressure, RO feed pressure, and brine pressure, should be monitored at regular intervals. Should
any parameter drift from the desired value, the plant operation should be stopped; otherwise damage may occur
to the materials, equipment and workers in the plant along and a bad quality of product water may be produced.
2. Research objectives
The main aim of the work is to estimate the availability of the RO plant. The specific objectives are to:
• carry out a detailed survey of operating conditions and performance of materials, components, and
subsystems of the RO plant;
• identify causes and sequences of failures in the RO plant; and,
• assess the failure rate of the RO plant by identifying the components and events causing downtime and
unwanted effects.
3. Failure analysis
From the data recorded at the plant and from the design specifications of all the components and subsystems, it
is felt that the failure of the RO plant many arise from failures of the beachwell pumps and the seawater feed
line, the pretreatment process (chlorination, dechlorination, antiscalant addition, and pH control), cartridge
filters, high-pressure pumps and motors, membranes, the cleaning/flushing system, the energy recovery system,
46
the post-treatment system for the product water, valves (including leakage), instruments and controls, and
various pipe lines. Failure could also occur from loss of the power supply and human error; however, these are
not considered since these are independent from the RO technology.
Several safety analysis methods are used to assess the failure of a system. Failure Mode Effects and
Criticality Analysis (FMECA) are considered as an important step in the risk and safety analysis study of a
system. It involves reviewing as many components, assemblies and subsystems as possible to identify failure
modes, critical failures and their causes and effects, Billinton & Allan (1992), Henley & Kumamoto (1992). It is
mainly a qualitative analysis and is a very useful tool in suggesting design improvements to meet the reliability
requirements. In this study, it was decided to go for a top-down approach, and the analysis was extended down
to a level at which failure rate estimates were available from the data already collected.
Since the RO plant was a new plant, the construction of FMECA was hampered due to lack of enough data
on failures, their causes and their effects. Statistically the estimation of the failure times of all the components
and subsystems of the plant was very difficult in the absence of sufficient objective data. Hence, in a few cases,
subjective assessment of failure times, causes and effects of failures has been made from the experience of the
plant personnel. The data on all the special incidents noticed at the RO plant were recorded in the log book and
analyzed for the FMECA of the RO plant in the following fashion:
• The RO plant was divided into its subsystems: beachwell, pretreatment of the feed, filtration, high-pressure
pump, membrane system, energy recovery, and post-treatment of the product water and.
• The system’s functional diagrams and drawings were reviewed to determine the interrelationships between
the various subsystems.
• The operational conditions, which might affect the system’s performance, were assessed and reviewed to
determine the adverse effects that they could generate on the system.
• For each subsystem (and, if possible, for components), the operational and failure modes were identified and
recorded. In addition, possible failure mechanisms, which might produce the identified failure modes, were
also recorded.
• The mechanisms to detect a failure mode were then studied.
• From the available data, a failure rate assessment was made for each of the failure modes. (Unfortunately the
data was not sufficient since only two-and-a-half years’ data were available).
• The failure effects were ranked based on their importance, and critical failures were identified.
Event tree analysis (ETA) methodology is an inductive analysis. It starts with a specific initial event and
follows all progressions of the accident and its contribution to the failure of other components, and subsystems.
The probability of failure of the component/subsystem is calculated by tracing back and identifying the
possibility of all accidents the let to it.
Fault Tree Analysis (FTA) is another method used in safety analysis. It is a deductive methodology for
determining the potential causes of accidents, or for system failures more generally, and for estimating the
failure probabilities. FTA is centered about determining the causes of an undesired event, refereed to as the top
event, since fault tree are drawn with it at the top of the tree. It then proceeds downward, dissecting the system
in increasing details to determine the root causes or combinations of causes of the top event. Top events are
usually failures of major consequence, engendering serious safety hazards or the potential for significant
economic loss.
FTA yields both qualitative and quantitative information about the system under study. Fault tree
construction provides the analysts with a better understanding of the potential sources of failures, which will
lead to rethink the design and operation of the system in order to eliminate many potential hazards. Once
completed, the fault tree can be analyzed to determine the combinations of component failures, operational
errors, or other faults that initiate the top event. Finally, fault tree may be used to calculate the demand failure
probability, unreliability, or unavailability of the system under study. A FT is a diagram that displays the logical
interrelationships between the basic causes of the failure. A few standard symbols, commonly known as gates,
are used to depict the relationships between the events giving rise to the failure of the system. “AND gates” are
used to connect the groups of events and conditions, if all of them are required to be present simultaneously to
cause the hazardous event to occur; whereas “OR gates” represent the existence of the alternative ways in which
a failure can occur.
4. Methodology
The aim of reliability and availability of any continuously operated system in industry is to produce the desired
level of output on a continuous basis without failures and to restore the system into an operable state as early as
possible whenever the system suffers from failure, Haimes et al. (1992). The management can achieve the
47
lowest total costs possible, if reliability and availability are maintained at a high level. Reliability management
of a system is a systematic approach to identifying and assessing the causes and frequencies of its failures, and
reducing and/or controlling the effects of failures to provide the satisfactory performance of the system to the
society, Bazovsky (1961). Component failures and human errors not only affect the performance of a system
but they can also cause accidents. The frequencies of such events are assessed during the design stage of the
system. In order to derive the maximum benefit, reliability analysis of such a system has to be made at the
design stage and it should be carried on until the system is finally replaced. A fresh analysis may be
recommended whenever there are modifications of the system. Since continuously operated systems can
tolerate failures, the systems can be restored to an operational level by carrying out the required repairs and
maintenance. For such continuous systems, a more appropriate performance measure is availability, which is
defined as the probability that a system or component is performing its required function at a given point in time
or over stated period of time when operated and maintained in a prescribed manner, Eblening (1997). It is
classified under point-wise availability (i.e., availability at specific points in time), interval availability (i.e.,
availability for an interval of time) and inherent availability (i.e., long-run availability). For continuously
operated systems, inherent availability is the most meaningful; it is a ratio of the total uptime (i.e. total operating
time) to the total system time (i.e. sum of uptime and downtime). Availability takes into account not only the
failure aspect of the system (reliability), but also the restoration of the failed components through repair or
replacement (maintainability). Maintainability is a design feature, and appropriate considerations are required
regarding this aspect for any continuously operated system. Since the total downtime is composed of the time
for inspection and detection of faults, the time to repair faults, and administrative time, one can aim at
minimizing each of these components to reduce the total downtime. Mathematical models have been designed to
estimate the downtimes of systems, and researchers have approximated the downtime distributions as negative
exponential distributions. If M ( t ) represents the maintainability function and µ represents the mean down
time, then M ( t ) is given by
M (t ) = 1 − e − µt
(1)
If the time-to-failure distribution indicates the reliability of the system, and most commonly, exponential
distribution is used for this purpose, the reliability function is given by
R(t ) = e − λt
(2)
A detailed FT diagram was drawn for the plant using OR and AND gates, as shown in figure 4, Chaudhuri &
Hajeeh (1999). The outputs of these gates in terms of the event unavailability were computed. Unavailability of
the plant due to power supply disruption and human error was not included in the overall estimation of
unavailability since these are independent of the RO technology. Consider the AND fault tree as given in figure
2 where simultaneous existence of the events B1, …,Bn results in the top events. Thus, the system unavailability
Qs (t ) is the probability that all events exist at time t and is given by Equation 3.
TOP EVENT
...............
B1
B2
B3
Bn
Figure 2. Gated AND fault tree
n
Qs (t ) = ∏ (1 − Qi ) = Pr( B1 ∩ B2 ∩ ... ∩ Bn ) = Pr( B1 ) Pr( B2 )... Pr( Bn )
(3)
i =1
For an OR fault tree as given in Figure 3, the top event exists at time t if and only if at least one of the n
basic event occurs at time t. Therefore, the system unavailability is given by Equation 4.
48
TOP EVENT
B1
B2
..............
B3
Bn
Figure 3. Gated OR fault tree
n
Q s (t ) = 1 − ∏ (1 − Qi ) = Pr( B1 ∪ B 2 ∪ ... ∪ B n ) = 1 − {[1 − Pr( B1 )][1 − Pr( B2 )].....[1 − Pr( Bn ]}
(4)
i =1
Figure 4. A FT diagram plant using OR and AND gates
5. Results and Discussion
Since the plant under study is new, small, and efficient, the unavailability (down time) is used as a performance
indicator. The availability is calculated as
Downtime
Q =1− A =
(5)
Uptime + Downtime
Since the plant was new, and statistical assessment of failure and repair rates was difficult due to the lack of
data, system availability was computed as a ratio of the time the system was working satisfactorily and the total
system time.
49
No
Sub-System
1
2
3
Beachwell pumps (bwp): 1 operated , 1 standby
Main filters:
1 operated, 2 standby
Dosing:
Antiscalant
Acid
NaHSO2
4
High-pressure pumps (HPP):
Train 1
Train 2
5
Energy recovery turbines (ERT):
Train 1
Train 2
6
Reverse osmosis system:
- Without ERT
- With ERT
Plant overall unavailability (Qplant)*:
- Without ERT
- With ERT
Formula
Unavailability (Q)
0.1262 x 10-3
0.00013
0.00000
0.2314 x 10-4
0.1134 x 10-2
0.2314 x 10-4
0.00118
0.03393
0.02985
0.03065
0.03107
0.00118
0.00381
0.002318(0.9days/year)
0.005113(1.87 days/year)
* Qplant = 1- (1-Qbwp)(1-Qpre-treatment)(1-QRO system)
Table 1. Unavailability for the different sub-systems for the RO plant
6. Conclusion
The reverse osmosis (RO) process is one of the major processes for producing potable water from seawater
through desalination. The performance of any RO plant depends on the failure behavior of its subsystems. Since
RO plant is to be continuously operated with minimum amount of down time, the reliability of the subsystems is
to be maintained at high level by proper design and selection of materials of these subsystems. Standby
redundancies are needed for the subsystems, which are critical in nature and whose failures will cause the entire
plant to stop. Since RO plants are likely to show very high availability, operation of a parallel combination of a
few RO plants can become a viable alternative to other desalination plants used in the middle-east countries.
Design and installation of RO plants for desalination of seawater are recommended for the region because it has
high performance and economical. Future research should attempt to compare the performance of RO
technology with other water desalination technology such as multi stage flash (MSF), which is extensively used
in the region.
References
Bazovsky, I. (1961) Reliability Theory and Practice, Prentice Hall, Inc., Englewood Cliffs, New Jersey
Billinton, R. and Allan, R. N. (1992) Reliability Evaluation of Engineering Systems, concepts and Techniques, Plenum Press, New York
Chaudhuri, D. and Hajeeh, M. (1999) Reliability, availability and risk assessment for Reverse Osmosis, Technical Report, Kuwait Institute for
Scientific Research, Kuwait
Eblening, C. E. (1997) Introduction to Reliability and Maintainability Engineering, MacGraw-Hill Companies, Inc., New York
Ebrahim, S. and Abdel-Jawad, M. (1994) Economics of seawater desalination by reverse osmosis, Desalination, 99 (11), 39-55
Haimes, Y. Y., Moser, D. A. and Stakhiv, E. Z. (edited) (1992) Risk-Based Decision Making in Water Resources, American Society of Civil
Engineers, New York
Henley, E. J. and Kumamoto, H. (1992) Probabilistic Risk Assessment, IEEE Press, New York
Parekh, B. S. (edited) (1988) Reverse Osmosis Technology, Marcel Dekker Inc., New York
50
The use of IT within maintenance management for continuous improvement
Anders Ingwald, Mirka Kans
School of Technology and Design, Department of Terotechnology, Växjö University, S-351 95 Luckligs plats 1,
Sweden
Abstract: For a long time maintenance has been treated as a separate working area, isolated from other areas
such as production and quality. However, the awareness of maintenance importance and complex nature has
increased. To be able to take full advantage of maintenance, systems that assist in the task of planning and
follow up on a continuous improvement basis are required. This paper describes the use of maintenance
practices and how IT is used for maintenance management by using data from a survey about maintenance
management in Swedish industry. The paper finds out that maintenance activities connected to the Plan and
Check phases of the PDCA-cycle are emphasised to a low extent while the Do phase is more emphasised and
that companies tend to intentionally select CMMS functionality that supports the Plan phase of the PDCA-cycle,
if they put high emphasis on these kinds of activities.
1. Introduction
Traditionally the maintenance has been planned and executed separately from dependent working areas such as
production and quality. New manufacturing philosophies and complex production equipment has changed the
situation. The consequences of problems and disturbances in the production process are today diverse and
severe leading to higher demands on a reliable production, see e.g. Luce (1999), Vineyard et al. (2000) and
Holmberg (2001). It has been shown that company-wide integration of maintenance and long term maintenance
plans are important for the success of companies, Jonsson (1999). Furthermore, Mitchell et al. (2002) showed
in a case study performed in England that successful companies have a better maintenance practice. They also
conclude that a good maintenance practice can impact on broader practices and strategies and generate positive
synergy effects. The changed role of maintenance has lead to new demands on information technology (IT)
systems for the planning, execution and follow-up of maintenance activities. Traditional computerised
maintenance management systems (CMMS) are supporting the execution of maintenance but gives low support
for the planning and follow-up phases, see e.g. Liptrot & Palarchio (2000). Pintelon & Van Puyvelde (1997)
point out the importance of a well functioning computerised maintenance reporting system and also the fact that
most systems in this area are limited only to budget reporting.
The level of computerisation and IT maturity varies between different activities within the company. The
level of computerisation in maintenance management could for instance still be considered as low, and
maintenance management information technology (MMIT) has not in general been in focus when developing
the enterprise IT strategy. Jonsson (1997) reports that 64% of the 284 Swedish manufacturing firms included in
the study used manual information systems. Still we can find maintenance departments relying on manual
systems, or where MMIT is combined with paper documents and where a low extend of history is kept. The use
of IT is connected to the main focus or main goals of the maintenance organisation. This focus could change in
time, generally from efficiency goals towards higher goals of effectiveness and cost-effectiveness, but the
opposite is also a possibility. If the change of focus is connected to the concept of continuous improvement, see
for instance Deming (1986), the direction of the change would be towards higher maintenance goals. This paper
will further explore the use of IT within maintenance in Swedish manufacturing industry, especially trying to
answer whether continuous improvement is supported by the use of IT tools.
2. Study of IT use within maintenance
To fulfil the aim of this paper the results from a survey conducted among Swedish industry during 2003 were
used. In the following the survey participants, the choice and processing of variables and the connection
between variables and the PDCA-cycle are presented.
2.1 Survey participants
For this cross-sectional survey production plants with more than 100 employees where selected using
information from Statistics Sweden (Statistiska Centralbyrån). The population was selected based on the
Swedish Standard Industrial Classification (SE-SIC) 2002. The following industries were selected for the survey:
Mining and quarrying except energy producing material, Manufacturing, Electricity, gas and water supply and
Transport, storage and communication. 1440 questionnaires were sent out. Because of low number of responses
from companies in some industries we limited the survey to industries with high response rate. The total number
51
of questionnaires in this restricted group was 539 and number of respondents 118 which gave a response rate of
about 22%, see Alsyouf (2004). The respondents in this restricted population were distributed according to
figure 1.
60
Percentage
50
40
30
20
10
M
ec
ha
ni
ca
le
ng
in
ee
rin
W
g
o
P
od
ha
an
rm
d
ac
tim
eu
be
tic
r
al
,C
he
S
m
te
ic
el
al
an
d
M
et
al
w
or
k
P
ul
p
an
d
P
ap
er
M
ed
ia
,P
rin
tin
g
P
et
ro
ch
em
ic
al
0
Figure 1. Distribution of respondents
2.2 Determining and describing the variables of the study
From the survey material two questions were used as input for this study: M1 (How much emphasise is placed
on different activities) and IT4 (To what extent different CMMS modules or functions are used), see Appendix 1.
For Question M1 an ordinal scale from 0 to 5 was used, where 1 denoted Not important and 5 Very important.
For question IT4 an ordinal scale from 0 to 5 was also used, where 1 denoted Used minimally and 5 Used
extensively. For both questions alternative 0 denoted Do not have. From M1 15 of a total of 26 variables were
used. The remaining 11 variables were not relevant for our purpose. We then mapped the selected M1-variables
against the IT4-variables, i.e. which IT-functionalities are required to perform a certain maintenance activity.
The result was that three IT4-variables of a total of twelve where excluded as being irrelevant for our case. The
results of the mapping after the reduction are shown in Table 1.
In the following, a description of the connection between the remaining variables is given starting from
maintenance activity number one, where on-line monitoring of critical machinery is directly connected to the
functionality of analysing the condition-monitoring parameters. Further, equipment failure data capture and
storage, which is the basis for equipment failure diagnosis, is needed for the recording of period and frequency
of failures. The recording of period and frequency of short stoppages in addition to the ability to capture and
store equipment failure data is connected to the use of maintenance key performance measures, as short
stoppages is one of the parameters required for calculating Overall Equipment Effectiveness (OEE). The poor
quality rate is also a parameter of OEE and therefore connected to the maintenance key performance measures.
The activity of using failure historical data requires information about the equipment in form of parts list and the
repair history. Information from different parts of the company requires of course access to data from
production, finance, quality assurance, purchasing etc., but in this study only the maintenance related parts of
data capturing, processing and storage was included. The functionalities of CMMS included in IT4 that supports
the use of company wide information are equipment repair history, equipment failure data and maintenance key
performance measures (which could include data from other locations than the maintenance organisation). For
analysing equipment failure causes and effects, data from the equipment repair history, equipment failure data
and condition-monitoring data could be utilised.
Experiences from part repairs and the ability to find the failed component are important for the activity of
restoring equipment to operation. These information are found in the equipment parts list and the repair history.
For performing the maintenance according to original equipment manufacturers (OEM) recommendations,
planning functionality is needed for work orders, preventive maintenance activities and spare parts requirement.
The equipment parts list is needed to allocate the maintenance activities. The same functionalities are needed for
preventive maintenance based on statistical modelling of failure data. In addition, failure data is needed for the
modelling part. Activities performed based on condition-monitoring requires data from the repair history to
adjust warning levels and the ability to analyse condition-monitoring parameters. The activity has not to be
scheduled in beforehand, but except from that, the same planning functionality as for preventive maintenance is
required. For decreasing the repair time, past repair experience is an important input when optimising the work
order scheduling. Inventory control and spare parts requirement planning is required when working with
keeping the spare parts inventory at a minimum. Historical repair data from current equipment is used when
52
evaluating and selecting OEM, while data about failures, repairs and other available maintenance performance
measures such as OEE could be utilised for the improvement of production, for instance by finding problem
machines and reducing bottlenecks by more efficient maintenance.
Table 1. Variable mapping after reduction
2.3 Connecting the maintenance activities with continuous improvement
The 15 remaining variables of M1, describing where the emphasis of maintenance is placed, were categorised
into four groups: Information gathering, Information analysis, Maintenance execution and Improvement
activities, as shown in Table 2. Furthermore, the relationships between activities and the PDCA-cycle were
determined and included in Table 2.
The first two groups covering variables 1-7 are activities connected to information needs and are input to
the first phase of the PDCA-cycle, Planning. The groups three and four covering variables 8-15 are connected to
the Do phase of the PDCA-cycle, where improvement activities are carried out. In the Check phase, information
coverage is once more important, whereas the activities described in the variables 1-7 are used. The Act phase is
aiming at making the improvement activity tested in the Do phase permanent and will affect future maintenance
activities, for instance those described in variables 8-11, and the improvement activities, such as described in
variables 12-15. The mapping between maintenance activities and continuous improvement shown in Table 2 is
only an example valid for this dataset. We do not in any way say that these are the maintenance activities that
are required for continuous improvement.
Plan
→
Do
→
Check
→
Information
gathering
1. On-line
monitoring of critical
machinery
Information
analysis
5. Use of failure
historical data
Maintenance
execution
8. Restoring
equipment to
operation
Improvement
activities
12. Decreasing the
repair time
Information
gathering
Activity 1 to 4
Information
analysis
Activity 5 to 7
2. Recoding of the
6. Use of company 9. Performing the
period and
wide information for maintenance tasks
frequency of failures diagnosis
according to OEM
13. Keeping the
level low in spare
parts inventory
3. Recoding of the
period and
frequency of short
stoppages
10. Performing the
maintenance tasks
based on statistical
modelling of failure
data
14. Helping the
purchase
department in OEM
selection
11. Performing the
maintenance tasks
based on condition
monitoring
15. Helping to
improve the
production process
4. Recording the
poor quality rate
7. Analysing
equipment failure
causes and effects
Act
Will affect
maintenance
execution and
improvement
activities
Table 2. Maintenance activities related to improvement according to the PDCA-cycle
53
3. Analysis
Respondent answers for each of the fifteen maintenance activities were divided into two groups according to the
importance: Very high importance (5) and High importance (4) in one group referred as “High” and another
group covering other answers (3, 2, 1 or 0) referred as “Low”. The percentage distribution of these groups is
shown in Figure 2. As can be seen, much emphasis is put on variables 7, 8, 9, 12 and 15, i.e. restoring
equipment to operation, performing the maintenance tasks according to OEM, decreasing the repair time and
helping improve the production process, while low emphasis is put on variables 6, 10 and 13, i.e. use of
company wide information for diagnosis, performing the maintenance tasks based on statistical modelling of
failure data and keeping the level low in spare parts inventory.
100
Percentage
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Variable number
High emphasis (5 and 4)
Low emphasis (3 to 0)
Figure 2. Rating of the importance of maintenance activities
The variables of the matrix presented in Table 1 were analysed by comparing the maintenance activities
seen as important with the extent of which CMMS functionality that are used to support the activities, see Table
3. For example of the companies that put high emphasis on activity 6, use of company wide information for
diagnosis, 38% have high use of the CMMS functionality e, equipment failure diagnosis, while of the
companies that put a low emphasise on the same activity, only 2% have high use of the CMMS functionality e.
Table 3. CMMS-functionality used for emphasised maintenance practices
A set of functionality is forming the core of a CMMS, see for instance Kans (2005), i.e. work order
handling, preventive maintenance scheduling, spare parts inventory, plant register, cost and budgeting and
maintenance history. To be able to distinguish if CMMS functionality are intentionally selected by the company
the authors defined a significant difference in the use of functionality as an increase of at least 50% between
54
those companies with low emphasis on the activity compared to those with high emphasis. As expected, some
CMMS functionality have high use independent of whether the maintenance activity is emphasised or not: work
order planning and scheduling, preventive maintenance planning and scheduling, equipment parts list,
equipment repair history, inventory control and spare parts requirement planning. These functionality all belong
to the core of a CMMS and are almost always available regardless if the user asks for them or not. Significant
differences could be found for following functions: equipment failure diagnosis, analysis of conditionmonitoring parameters and maintenance key performance measures.
4. Results and conclusion
From Figure 2, we can see that information gathering activities in general gets the lowest attention. The similar
situation could be noted for information analysis: only the activity of analysing equipment failure causes and
effects is highly emphasised in over half of the cases. The main emphasis is put on the execution of maintenance,
reflected in activities such as restoring equipment to operation and performing planned maintenance. Still,
companies are interested in improvement: two out of four improvement activities scored over 50% and one was
close to 50% (49%). Thus, the Plan and Check phases of the PDCA-cycle are emphasised to a low extent while
the Do phase is more emphasised. This is remarkable, as gathering information and analysing the current
situation is the basis for further improvement work. Our interpretation is that the maintenance organisations are
aware of the relationship between maintenance and other closely related working areas, but that resources, e.g.
in form of knowledge, time or appropriate tools are lacking for the implementation of the continuous
improvement concept within the organisation.
A slightly different view is given when comparing maintenance activities with CMMS use. While many of
the CMMS functionality seems to be used because they are belonging to the core of a CMMS, some CMMS
functionality appear to be selected intentionally based on the company's need in maintenance. These CMMS
functionalities are mostly related to planning and execution of maintenance and less to improvement activities.
The exceptions are variable 15, helping improve the production process, and to some extent variable 7,
analysing equipment failure causes and effects, which can be referred to improvement activities. In other words,
companies tend to intentionally select CMMS functionality that supports the Plan phase of the PDCA-cycle if
they put high emphasise on the activities that were defined in this study as belonging to especially information
gathering but also analysis.
According to the study most CMMS functionality that directly supports improvement activities, see Table 2,
are highly used. This is not to say that they are actually used for improvement, since most of them are used
almost to the same extent regardless if the improvement activities are emphasised or not. Furthermore, the
CMMS functionality that supports data gathering is not used to a great extent. Thus, the entire improvement
cycle is in general not supported by the use of IT tools in Swedish industry.
References
Deming, W. E. (1986) Out of the crisis, Cambridge University press, Cambridge, Massachusetts
Holmberg, K. (2001) Competitive Reliability 1996-2000, Technology Programme Report 5/2001, Final Report Edited By Kenneth
Holmberg, National Technology Agency, Helsinki
Jonsson, P. (1997) The status of maintenance management in Swedish manufacturing firms, J. of Quality in Maintenance Engineering, 3
(4), 233-258
Jonsson, P. (1999) Company-wide integration of Strategic Maintenance: an empirical analysis, International J. of Production Economics,
60-61, 155-164
Kans, M. (2005) On the identification and utilisation of relevant data for applying cost-effective maintenance, licentiate thesis, Växjö
University, School of Technology and Design
Liptrot, D. and Palarchio, E. (2000) Utilizing advanced maintenance practices and information technology to achieve maximum
equipment reliability, International J. of Quality & Reliability Management, 17 (8), 919-928
Luce, S. (1999) Choice criteria in conditional preventive maintenance: short paper, Mechanical Systems and Signal Processing, 13 (1),
163-168
Mitchell, E., Robson, A. and Prabho, V. B. (2002) The impact of maintenance practice on operational and business performance,
Managerial Audition J., 15 (5), 234 – 240
Pintelon, L. and Van Puyvelde, F. (1997) Maintenance performance reporting systems: some experiences, J. of Quality in Maintenance
Engineering, 3 (1), 4-15
Vineyard, M., Amoako-Gyampah, K. and Meredith, J. (2000) An evaluation of maintenance policies for flexible manufacturing systems:
a case study, International J. of Operations and Production Management, 20 (4), 409-426
55
Appendix 1. Questions used from the survey
M1
How much emphasize is placed on each of the following activities:
Do not have
Restoring equipment to operation (acute)
Installing new equipment
Keeping the level low in spare parts inventory
Having inventory between machines, Work in
Process/Progress (WIP)
d. Decreasing the repair time
e. Investing in improving the skills and competence of
maintenance staff
f. Use of computerized maintenance management systems
(CMMS)
g. Analysing equipment failure causes and effects
h. Using failure historical data
i. Off-line Monitoring of critical machinery purchasing
(production is stopped during test)
j. On-line Monitoring of critical machinery (test is done
during production)
k. Performing the maintenance tasks according to the
original equipment manufacturer (OEM)
recommendations
l. Performing the maintenance tasks based on Condition
Monitoring
m. Performing the maintenance tasks based on statistical
modelling of failure data
n. Helping the purchasing department in OEM selection
o. Performing periodic planned replacement
p. Automatic diagnosis (expert system)
q. Remote diagnosis (measurements are sent to another
places for analyse)
r. Use of company wide information for diagnosis
s. Cross functional groups (for instance improvement
groups)
t. Helping improve the production process
u. Helping design the production process
v. Recording the period and frequency of failures
w. Recording the period and frequency of short stoppages
x. Recording the poor quality rate
y. Annual overhaul
a.
b.
c.
IT4
0
□
□
□
□
Not
important
1
2
□
□
□
□
□
□
□
□
Very
important
4
5
□
□
□
□
□
□
□
□
3
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
□
To what extent are each of the following computerised maintenance system modules or functions
used:
□ Do not have a CMMS
Do not have
Used
Used
minimally
extensively
0
1
2
3
4
5
a. Work-order planning and scheduling
□
□
□
□
□
□
b. Preventive maintenance planning and scheduling
□
□
□
□
□
□
c. Analysis of condition monitoring parameters
□
□
□
□
□
□
d. Equipment failure diagnosis
□
□
□
□
□
□
e. Equipment repair history
□
□
□
□
□
□
f. Equipment parts list
□
□
□
□
□
□
g. Manpower planning and scheduling
□
□
□
□
□
□
h. Inventory control
□
□
□
□
□
□
i. Spare part requirement planning
□
□
□
□
□
□
j. Material and spare parts purchasing
□
□
□
□
□
□
k. Maintenance budgeting
□
□
□
□
□
□
l. Maintenance key performance measures
□
□
□
□
□
□
56
An approximate algorithm for condition-based maintenance applications
Matthew J. Carr, Wenbin Wang
Centre for Operational Research and Applied Statistics, Salford Business School, University of Salford, UK.
Abstract: Established condition-based maintenance techniques can be computationally expensive. In this paper
we propose an approximate methodology using extended Kalman filtering and condition monitoring
information to recursively establish a conditional density for the residual life of a component. The conditional
density is then used to construct a maintenance/replacement model. The advantage of the methodology, when
compared with alternative approaches, is the computational efficiency that potentially enables the simultaneous
condition monitoring and associated inference for a large number of components. The application of the
methodology is described for a vibration monitoring scenario and demonstrated using actual case data.
Keywords: Condition-based maintenance, condition monitoring, extended Kalman filtering, residual life
1. Introduction
The Kalman filter is a well known technique in the context of state estimation for discrete time stochastic
systems. Efficient updating and prediction equations are easily established to obtain the parameters of the
conditional distribution for some underlying state using stochastically correlated indicatory information. As
such, it can potentially be a useful technique for condition monitoring (CM) applications involving the
simultaneous monitoring of a large number of components when limited computational resources are available.
This is particularly important for on-line diagnosis and prognosis when rapidly processing a large amount of
data is required. Existing techniques such as proportional hazards modelling (Makis & Jardine (1991), Banjevic
& Jardine (2004)) and non-linear probabilistic stochastic filtering (Wang & Christer (2000), Wang (2002))
involve numerical integration routines and can be computationally expensive to apply simultaneously to a large
number of individually monitored components.
The standard Kalman-filter can be derived within the framework of a general non-linear filter when the
system and observation dynamics evolve linearly and the model errors are assumed to be independent and
follow 0-mean Gaussian white noise processes, see Jazwinski (1970). However, in reality, these assumptions
rarely hold. There are a number of varieties of the extended Kalman-filter (EKF) available in the literature on
stochastic state estimation techniques. EKF’s are designed to enable the application of variations on the
standard Kalman-filtering methodology to linearised versions of non-linear systems. The linearisarion is
achieved using Taylor expansions of the state and observation equations. The order of the filter is dependent on
the number of terms in the Taylor expansions that are included in the linearised equations.
In this paper, we introduce a semi-deterministic form of the EKF where the state of the system is unknown
but evolves deterministically. The approach is then adapted specifically for applications involving vibration
information where, the underlying state that we are attempting to predict is defined as the residual life of an
individual component. The deterministic element of the EKF process is designed to facilitate the exact
relationship between realisations of the actual underlying residual life at sequential CM points throughout its
lifetime. We then illustrate the application of the vibration model using a case-based example. When
constructing EKF algorithms for conditional residual life prediction, the same principles will apply when using
different types of CM information with the only difference being the specification of the relationship between
the monitored information and the underlying residual life.
2. A semi-deterministic extended Kalman filter
For a discrete time process, the evolution of a general state vector is described using the non-linear function
x i +1 = f ( x i ) + η i
(1)
where, xi is the realisation of the state vector at the ith discrete time point at time ti. For a deterministic
relationship, ηi is a 0-mean process with covariance matrix 0 and is henceforth removed from consideration. At
the ith discrete time point, we describe the relationship between an observed information vector yi and the
underlying state using the non-linear function
y = h( x i ) + e i
(2)
i
where, the measurement errors are normally distributed as ei ~ N(0, Ri).
The first step in applying the extended Kalman filtering methodology is to linearise the functions f and h in
equations (1) and (2). At the ith discrete time point, we define xˆ i|i = E[xi|Yi] as an estimate of xi that is
conditioned on the observation history available until that point; Yi = {y1, y2, …, yi}.
57
We also define
xˆ i +1|i = E[xi+1|Yi] as the one-step prediction of xi+1 that is again conditioned on Yi. The non-linear functions f and
h are linearised as
f ( x i ) ≈ f ( xˆ i|i ) + f ' ( xˆ i|i )( x i − xˆ i|i )
(3)
h( x i ) ≈ h( xˆ i|i −1 ) + h ' ( xˆ i|i −1 )( x i − xˆ i|i −1 )
(4)
and using these approximations, the state transition expression of equation (1) is written as
x i +1 = f ' ( xˆ i|i ) x i + u i
(5)
where u i = f ( xˆ i|i ) − f ' ( xˆ i|i ) xˆ i|i . Similarly, the observation expression (equation (2)) becomes
y = h ' ( xˆ i|i −1 ) x i + e i + w i
(6)
i
where w i = h( xˆ i|i −1 ) − h ' ( xˆ i|i −1 ) xˆ i|i −1 .
Assuming that the initial values for the underlying state, x0, and associated covariance matrix, P0, are
known, the Kalman filtering methodology can then be applied directly to the linearised system given by
equations (5) and (6). The extended Kalman filter is a recursive algorithm incorporating prediction and
updating steps at each recursion. At the ith discrete time point, with the availability of the observed information
yi, the equation for updating the mean estimate of the underlying state is
xˆ i|i = xˆ i|i −1 + k i [ y − h ' ( xˆ i|i −1 ) xˆ i|i −1 − w i ] = xˆ i|i −1 + k i [ y − h( xˆ i|i −1 )]
(7)
i
i
where, the Kalman gain function (Harvey, 1989) is
k i = P i|i −1h ' ( xˆ i|i −1 ) T [ h ' ( xˆ i|i −1 ) P i|i −1h ' ( xˆ i|i −1 ) T + R i ] −1
(8)
For a semi-deterministic version of the EKF, the lack of variability is reflected in the prediction stage of each
recursion where, using the original transition expression given by equation (1), a one-step forecast of the mean
state vector is simply
xˆ i +1|i = f ' ( xˆ i|i ) xˆ i|i + u i = f ( xˆ i|i )
(9)
The covariance matrix for the state vector is also subjected to prediction and updating steps upon each recursion
of the algorithm. At the ith time point, the covariance matrix is updated using
P i|i = P i|i −1 − P i|i −1h ' ( xˆ i|i −1 ) T [h ' ( xˆ i|i −1 ) P i|i −1 h ' ( xˆ i|i −1 ) T + R i ] −1 h ' ( xˆ i|i −1 ) P i|i −1
which can be written as
P i|i = P i|i −1 − k i h ' ( xˆ i|i −1 ) P i|i −1
(10)
using the gain function given by equation (8). A one-step prediction of the covariance matrix is achieved using
the equation
P i +1|i = f ' ( xˆ i|i ) P i|i f ' ( xˆ i|i ) T
(11)
This concludes the description of the semi-deterministic EKF algorithm for general discrete time state-vector
and observation-vector processes.
The parameters of the algorithm are estimated in two stages. Firstly, the initial values x0 and P0 are
estimated from available data using maximum likelihood estimation (MLE). Secondly, considering n individual
observation histories, the parameters of the function h are estimated using MLE, the expectation-maximisation
(E-M) algorithm and the relationship
y | x ji ~ N (h ( x ji ), R ji )
(12)
ji
at the ith discrete time point in the jth history, where i = 1, 2, …, mj and j = 1, 2, …, n.
The selection of an appropriate function h(xi) is essential when designing a useful filter for a particular
scenario. We use the Akaike information criterion (AIC) which provides a means of comparing the maximum
likelihood values obtained for different candidate forms and is derived on the assumption that the actual
underlying dynamics of the state and observation processes can be described by a given model if its parameters
are suitably adjusted, see Akaike (1974). The AIC statistic is
AIC = 2 ( k – log e(L(θ ))
(13)
where L is the maximum likelihood value for the formulation and k is the number of parameters. The modelling
option producing the minimum AIC is the best choice for a given data set.
3. Residual life prediction using vibration information
3.1 The system
In this section, we tailor the extended Kalman filtering methodology described in section 2 to a CM scenario
involving vibration monitoring and residual life estimation for individual components. At the ith vibration
58
monitoring point at time ti, we define a singular underlying state, zi, as the residual life that remains before the
failure of an individual component. At this time, we obtain information in the form of the overall vibration level
yi that is used to refine our estimate of zi and Yi represents the vibration monitoring history { y1, y2, …, yi } that is
available until that point. If we assume that a given component has not failed before a specific monitoring point,
the residual life at that time can only be defined as a positive value. As such, we choose to model the residual
life as a log-normal random variable and define the unknown underlying state of the EKF algorithm as xi =
loge(zi) at the ith CM point. As discussed in the general specification (section 2), the updated and predictive
conditional densities for xi are derived using the EKF algorithm and are described by xi|i ~ Ν ( xî|i , Pi|i ) and
xi +1|i ~ Ν ( xî +1|i , Pi +1|i ) respectively. Similarly, at the ith CM point, the conditional densities for the underlying
residual life are z i|i ~ log Ν ( xˆ i|i , Pi|i ) and z i +1|i ~ log Ν ( xˆ i +1|i , Pi +1|i ) respectively. Point estimates of the residual
life are obtained as zˆ i|i = E[ zi|i ] = E [ z i | Y i ] and zˆ i +1|i = E[ z i +1|i ] = E [ z i +1 | Y i ] , with associated variance Λi|i
and Λi+1|i respectively, where the conditional mean estimate and variance at the ith CM point are
zˆ i|i = exp( xˆ i|i + 0.5Pi|i )
Λ i|i = (exp( Pi|i ) − 1) exp(2 xˆ i|i + Pi|i )
(15)
(16)
and analogous results apply for zˆ i +1|i and Λi+1|i. Modelling the transition in the underlying state between two
successive monitoring points, we have z i +1 = z i − (t i +1 − t i ) as the change in the residual life over the duration
between the ith and (i+1)th CM point when zi > ti+1 - ti. From equation (2), the relationship between the observed
vibration information and the underlying residual life is described by the expression
y i = h ( x i ) + ei
(17)
at the ith CM point where ei represents the observation noise.
3.2 The algorithm
As discussed in section 2, EKF’s and standard Kalman-filtering algorithms are often presented as a two step
process involving prediction and updating stages. The algorithm is initiated at the start of a new components
operational life using x̂ 0|0 and P0|0.
The prediction stage of the recursive algorithm forecasts the estimate of the state over the interval (ti, ti +1).
Assuming that the initial estimates ẑ 0|0 and Λ0|0 are known, the one-step deterministic prediction of the residual
life at the ith CM point is
zˆ i +1|i = zˆ i|i − (t i +1 − t i )
(18)
if zî|i > ti +1 − ti and zˆ i +1|i → 0 otherwise. As the change in the state over the interval (ti, ti +1) is deterministic,
the variance about the mean estimate remains fixed as Λi+1|i = Λi|i. Reversing the relationship given by equation
(15), we transform zˆ i +1|i into xˆ i +1|i for insertion into the next stage of the algorithm as
⎛
Λ i +1|i ⎞⎟
xˆ i +1|i = log e ( zˆ i +1|i ) − 0.5 log e ⎜1 +
(19)
⎜
zˆ i +1|i 2 ⎟⎠
⎝
and again, without any random variation in the prediction of the state, we have no change in the variance over
the interval between CM points (ti, ti +1) as Pi +1|i = Pi|i .
The updating stage of the algorithm is undertaken upon obtaining the vibration information at the ith CM
point. The updating equation for the log of the residual life is
xˆ i|i = xˆ i|i −1 + k i ( y i − h ( xˆ i|i −1 ))
(20)
where, from equation (8), the Kalman gain function is
Pi|i −1 h ' ( xˆ i|i −1 )
ki =
h ' ( xˆ i|i −1 ) 2 Pi|i −1 + Ri
and h ' ( xˆ i|i −1 ) = dh( x) / dx
x = xî|i −1
(21)
. The updating equation for the variance about xˆ i|i is
Pi|i = Pi|i −1 − k i h ' ( xˆ i|i −1 ) Pi|i −1
(22)
and to obtain the conditional mean and variance of the residual life, we utilise equations (15) and (16)
respectively.
59
3.3 Replacement decisions
For this particular application scenario, the main advantage of establishing a conditional probability distribution
over the evaluation of a single point estimate of the residual life is the availability of the cumulative density
function. The c.d.f. enables the construction of replacement decision models that incorporate the probability of
failure before a particular instant conditioned on the CM history to date. At each monitoring point throughout
the life of an individual component, an optimal replacement time can be scheduled using renewal-reward theory
and the long run ‘expected cost per unit time’, see Ross (1996). Initially, some further notation is defined; TR is
the planned replacement time (to be optimised), CP is the cost of a preventive replacement, and CF is the
replacement cost associated with the failure of an individual component. At the ith CM point (time ti) the
expected cost per unit time is given by
C (t i , TR ) = E[cycle cost | TR] / E[cycle length | TR]
(23)
where, C(ti,TR) is to be minimised with respect to TR. As such, the replacement decision at the ith CM point is
obtained via the minimisation of
C P + (C F − C P ) Pi ( z i < TR − t i | Y i )
(24)
C (t i , TR ) =
T −t
u + t i + (TR − t i )(1 − Pi ( z i < TR − t i | Y i )) +
R
i
∫ z pi ( z | Y i )dz
0
where u represents the age of the component prior to monitoring (for new components; u = 0), an upper-case P
represents a probability, a lower-case p denotes a conditional density and the residual life at the ith CM point is
distributed as z i|i ~ log Ν ( xˆ i|i , Pi|i ) .
4. Case example
In this study, the EKF algorithm developed in the previous section is applied to a scenario involving the
estimation of residual life. The condition monitoring information consists of the overall vibration level recorded
at irregular time points during the operational life of 5 components with associated failure-time information.
Previous studies have shown that vibration monitoring scenarios can often be modelled effectively using a twostage process representing normal and defective operation, see Wang (2002). The vibration signal is usually
relatively stable during the first stage and begins to increase upon the occurrence of a defect. It is in the second
stage of analysis that techniques such as stochastic filtering are useful for residual life prediction. In this
example, we are only concerned with components operating in a defective state and assume that a fault detection
technique has been used to determine the start of the second stage, defined as time 0 in the following. Figure 1
illustrates the vibration histories for the 5 components and the threshold level representing the start of the
defective stage of operation. The costs associated with failures and preventive component replacements are
£6000 and £2000 respectively.
35
CM history 1
30
CM history 2
Vibration level
CM history 3
25
CM history 4
20
CM history 5
Threshold
15
10
5
0
0
200
400
600
800
1000
Time (hrs)
Figure 1. The vibration monitoring histories
To initialise the EKF algorithm, we apply MLE using the failure time information from the 5 components in the
data set. The initial values are x̂ 0|0 = 4.048 and P0|0 = 0.223 and the corresponding initial estimate of residual life
is ẑ 0|0 = 64.039 with associated variance Λ0|0 = 1024.5. We consider a number of candidate forms for the
function h in equation (17) representing the relationship between the vibration information and the log of the
60
residual life. MLE and the AIC criterion in equation (13) are used to parameterise and select between the
candidate forms. The results of the parameter estimation and model selection process are given in table 1. It is
evident from the results in table 1 that the EKF modelling option with the functional form h(xi) = a + b exp{- c xi}
produces the minimum AIC and is the chosen function for this case study.
a
b
c
R
log(L(θ))
AIC
a + b xi
h(xi)
a + b / xi
a + b exp{- c xi}
39.138
-7.955
16.572
-64.924
135.848
0
41.688
22.672
-68.529
143.058
6.002
383.641
1.357
13.327
-62.418
132.836 *
Table 1. The estimated parameters and model selection results for candidate forms
of the vibration-based EKF algorithm
Figure 2 illustrates the conditional residual life distributions obtained at each monitoring point for component 4
using the vibration-based EKF algorithm. The black dots represent the actual underlying residual life at each
CM point. It is clear from the figure that the residual life distributions produced by the EKF algorithm are
appropriately distributed about the actual residual life for the component history considered and as we would
expect, accuracy improves as more CM information is obtained over time.
Figure 2. The conditional residual life distributions obtained at each CM point for
component 4 using the EKF algorithm
As discussed in section 3.3, a replacement decision at the ith CM point can be evaluated using the age of the
component, the average cost of failures and replacements, and crucially the conditional distribution for the
residual life. As such, the form of the conditional density has a large impact when scheduling the replacement
time. We are considering an irregular monitoring process and at the ith CM point, the time of the next CM point
would be unknown. As we are not modelling a decreasing operational capability, the desirable property of the
replacement decision would be to provide the maximum possible operational availability without the component
failing which is reliant on the fit of the conditional density about the actual underlying residual life.
Continuing the illustration of the analysis for component 4, by illustrating the expected ‘cost per unit time’
against the associated ‘time until replacement’, figure 3 demonstrates how the conditional distributions for the
residual life affect the associated replacement decisions at each CM point.
5. Discussion
In this paper, we have presented an approximate methodology for residual life prediction using the principles of
the well known Kalman filter. The methodology is designed to enable simultaneous and efficient
61
processing/inference for a large number of components using CM information. The decision to use the EKF
approach over less approximate methodologies, such as non-linear probabilistic filtering or proportional hazards
modelling, will be application specific and will depend on the particular trade-off in efficiency versus precision.
The trade-off will be dependent on the number of components that require simultaneous condition monitoring,
the processing power that is available, and the costs associated with preventive replacements and component
failures.
Figure 3. Illustrating the expected cost per unit time against potential replacement
time, TR, at each CM point for component 4
The results of the case study indicate that the EKF algorithm could be a useful approach for residual life
prediction for applications involving multi-component monitoring and limited computational resources. In an
extended version of this paper, currently under construction, we are including more information on the
parameter estimation and model selection processes. In addition, we discuss the inclusion of higher order terms
in the Taylor expansions of the system equations and provide a case comparison with some alternative
techniques used for residual life estimation.
Acknowledgement
The research documented in this paper has been supported by the Engineering and Physical Sciences Research
Council (EPSRC, UK) under grant EP/C54658X/1.
References
Akaike, H. (1974) A new look at the statistical model identification, IEEE Trans. Automatic Control, 19 (6), 716–723
Banjevic, D. and Jardine, A. K. S. (2004) Calculation of reliability function and remaining useful life for a Markov failure time process,
Proceedings of 5th IMA conference in maintenance and reliability modelling, Ed.’s; Wang, W., Scarf, P. and Newby, M., 5-7 April
2004, University of Salford, UK
Harvey, A. C. (1989) Forecasting Structural Time-Series Models and the Kalman-Filter, Cambridge University Press
Makis, V. and Jardine, A. K. S. (1991) Optimal replacement policies in the proportional hazards model, INFOR, 30, 172-183
Ross, S. (1996) Stochastic Processes, 2nd edition, Wiley
Wang, W. (2002) A model to predict the residual life of rolling element bearings given monitored condition information to date, IMA J.
of Management Mathematics, 13, 3-16
Wang, W. and Christer, A. H. (2000) Towards a general condition based maintenance model for a stochastic dynamic system, J. of the
Operational Research Society, 51, 145-155
62
The utility of a maintenance policy
Rose Baker
Centre for Operational Research and Applied Statistics, University of Salford, UK
[email protected]
Abstract: In the author’s concept of risk-averse maintenance, we seek to minimise the disutility of cost per unit
time rather than to minimise cost per unit time itself. This gives a maintenance policy that is optimal under riskaversion. The concept, introduced at the last conference in this series, is illustrated with a new example: the
maintenance of a standby system. A difficulty with the use of a utility function is that it is difficult to elicit a
value for the risk-aversion parameter. The use of a plot of the certainty-equivalent cost of all feasible
maintenance policies against the risk aversion parameter is suggested as a way round this difficulty.
Keywords: maintenance, inspection, utility function, risk aversion, standby system, Wald identity
1. Introduction
Economists have long used the concept of utility to model decision-makers’ risk aversion. When making an
investment in the stock market, for example, a strategy that sought only to maximise expected gain would be
unduly risky. Seeking rather to maximise a concave utility function of gain leads to the classic Markowitz
scheme, in which a large portfolio of stocks is purchased.
Utility functions are ubiquitous in economic thought, rather oddly however, in operational research cost or
cost per unit time is usually the criterion to be minimised. Risk aversion is simply ignored. The little published
work that is the exception occurs in warranty and inventory. Thus Padmanabhan & Rao (1993) and Chun and
Tang (1995) studied risk-averse warranty policies. In inventory, the classic newsvendor problem has now been
tackled from a utility-function viewpoint, e.g.Dohi et al. (1994), Eeckhoudt et al. (1995), Dionne & Mounsif
(1996) and Keren & Pliskin (2006).
Maintenance and reliability seems a suitable area to explore risk-averse policies, because there are
numerous cashflows occurring stochastically. What might be seen by some as overmaintenance, in the sense
that mean cost per unit time is not minimised, could be optimal as a risk-averse policy, in which the large
unscheduled losses from failure have such a disutility that very frequent maintenance is carried out.
Baker (2006) studied risk averse maintenance policies, and this paper extends that work in several
directions. A fresh example of the calculation of expected utility is given, for inspection of standby systems. The
use of utility functions other than the exponential Pratt-Arrow utility is discussed, and finally a graphical aid is
introduced that can help decision makers evaluate the relative merits of differing maintenance policies.
2. A utility-based approach
We first briefly recapitulate the key results of the earlier work. The exponential utility function of money y was
used, defined as
u=
1 − exp(−η y )
η
,
(1)
where η > 0 is a measure of risk aversion. An expenditure x = − y has disutility
−u =
exp(η x) − 1
η
,
(2)
and this form is used from now on. The certainty-equivalent cost D is the sum of money that if definitely
gained or lost would have the same expected utility as the variable cashflows of the policy. Hence if a policy is
carried out for time t , we have that
exp(η Dt ) − 1
η
or
D=
=
E exp(η X ) − 1
η
ln{E exp(η X )}
.
ηt
,
(3)
We consider failures of a system under some maintenance policy, where the system reaches a regeneration
point after a cycle of variable length. For example, at a regeneration point the whole system might be replaced.
63
Let the cost over the i th cycle be Fi . Then when cycles are of fixed length the certainty-equivalent expenditure
per unit time over k cycles is
ln E{exp(η ∑ i =1 Fi )}
k
D=
kητ
where τ is the fixed cycle length. Hence as the cycles are independent, their expected utilities are equal, and
D=
ln E exp(η F )
ητ
,
(4)
dropping the cycle subscript i .
When cycle length is variable, equation 4 becomes
exp(η DT ) = E N ( E{exp(η F )}) N ( t ) ,
where EN denotes the expectation over the number of cycles. Thus
ln EN exp{ln( E exp(η F )) N (t )} = η Dt .
(5)
To find D , we use the Wald identity from the theory of random walk. Given times S n to the end of the
n th cycle, and with M (θ ) as the moment generating function for cycle length, then
with unit expectation. The Wald identity
E
exp(θ Sn )
M n (θ )
is a martingale
exp(θ Sn )
=1
M n (θ )
(6)
follows, and the crucial step is to see that equation 6 is true under any stopping rule, and to choose to stop at a
very large time, the first regeneration after N (t ) >> 1 regenerations. Clearly, the stopping time ~ t for large t .
Then with n written as the random variable N (t ) , equation 6 becomes EN exp{θt − N (t ) ln M (θ )} ~ 1 or
(7)
ln E N exp {− ln M (θ ) N (t )} ~ −θ t
Comparing this with equation 5, we see that it is identical if we make the choice θ = −η D , when we must have
− ln M (−η D) = ln E exp(η F ) or
M (−η D) = 1/E exp(η F ).
(8)
The value of D can be found by solving equation 8 by Newton-Raphson iteration, which only requires
differentiation of M . When cycle length is fixed at τ , M (−η D ) = exp(−η Dτ ) and we regain equation 4. As
η → 0 , writing mean cycle length as l , M (−η D) → 1 − η l , 1/E exp(η F ) → 1 − η E ( F ) , and D → E ( F ) /l ,
the mean cost per cycle divided by mean cycle length.
3. Example: a standby system
To illustrate the use of this methodology, consider a standby system such as a lifevest, that ‘waits to work’.
There is a cost of inspection ci , and a cost of replacement or repair to ‘good as new’ of cr . We assume that
inspection and replacement take negligible time. Failure is undetected until next inspection, and there is a cost
per unit ‘downtime’. It is probably most realistic to consider that there is a probability pt of a disaster
occurring during a downtime t , where pt << 1 . A large cost c f is incurred only if a disaster occurs; the lifevest
is then needed and loss of life occurs because it is non-functional. A regeneration cycle ends with repair or
replacement.
A comprehensive reference to the non-statistical aspects of inspection models for standby systems is
chapter 8 of Nakagawa (2005), who derives an inspection policy for a standby electric generator, when repair
time is not negligible. The earliest work is by Barlow (1965), who derived inspection times Ti to minimize the
total cost of checking and replacing a unit, plus the cost of undetected failure, taken as proportional to the
‘down’ time, over the life of the unit. Another important early reference is Keller (1973).
When η → 0 the risk-averse policy tends to the minimum cost per unit time policy, which we first derive:
let the lifetime distribution of the item (e.g.lifevest) have survival function S ( x) , with mean µ =
64
∫
∞
0
S (u ) du
and let inspections occur at intervals T .The probability that the cycle length is mT is S ((m − 1)T ) − S (mT )
and the mean cycle length l = T
∑
∞
m=0
S (mT ) .
The mean cost per cycle is E ( F ) = ci l + c r + c f pE (T f ) , where the downtime T f is the period from a
failure at (m − 1)T + u , after the ( m − 1) th inspection at (m − 1)T to the next inspection. Then
∞
∞
E (T f ) = ∑ ∫ f ((m − 1)T + u )(T − u )du = T ∑ S (mT ) − µ ,
m =1
on integrating the pdf
f
T
0
m=0
of failure by parts. Hence
E ( F ) / l = pc f + ci / T + (cr − µpc f ) / l .
Note that there will be an optimum value of
T
(9)
to minimise cost per unit time if cr < µpc f , so that the expected
cost of failure with no maintenance exceeds the cost of replacing the item. A more intuitive derivation of this
cost is given in Baker (2007).
Turning to the risk-averse solution, this parallels the previous derivation, but now we need the generating
function of cycle length and E exp(η F ) instead of E ( F ) . From the cycle length probabilities
S ((m − 1)T ) − S (mT ) , we have that
∞
M (− s ) = exp(− sT ) − (1 − exp(− sT ))∑ S (mT ) exp(− msT ).
(10)
m =1
Conditioning on a cycle length of m , the survival function becomes
S (( m −1)T + u ) − S ( mT )
S (( m −1)T ) − S ( mT )
for 0 ≤ u < T , and
the expected downtime
T
E (T | m) =
S ((m − 1)T ) − ∫ S ((m − 1)T + u ) du
The expected cost of inspections is mci . We have
0
S ((m − 1)T ) − S (mT )
{
.
}
E exp(ηF ) m = exp(η (c r + mc i ) 1 − pE (T m) + pE (T m) exp(ηc f )
Removing the conditioning on cycle length,
∞
E exp(η F ) = ∑ exp(η (cr + mci )){S (m − 1)T ) − S (mT )
m =1
T
+ (exp(η c f ) − 1) p ( S ((m − 1)T )T − ∫ S ((m − 1)T + u ) du )}.
0
The certainty-equivalent cost D per unit time may now be found from equation 8.
Figure 1 shows the optimum value of the inspection period T as risk aversion η increases, for a standby
system where ci = 1, cr = 5, c f = 1000 and p = 0.01 . The failure time distribution is Weibull, so that the
survival function is S ( x) = exp(−(α x) β ) , and here α = 1, β = 2 .
65
Figure 1. How optimum inspection period T decreases as the risk aversion parameter η increases for maintenance of a
standby system.
Although the optimum policy that minimises cost per unit time has T = 0.994 , figure 1 shows that the optimum
inspection period decreases with increasing η in a way that looks exponential. The very high cost of a lowprobability failure drastically changes the optimum inspection period.
Figure 2 illustrates the way that this methodology might be used by practitioners. The minimum cost per
unit time policy is preferable to a policy with shorter inspection interval when there is no risk aversion, but the
certainty-equivalent cost becomes greater with the longer inspection interval as risk aversion increases.
Figure 2. Certainty equivalent cost per unit time D as a function of risk aversion parameter η for two inspection policies
for maintenance of a standby system: T = 0.994 (the optimum policy at zero risk aversion) and T = 0.5.
Clearly, it is unrealistic to ask engineers or managers ‘what is the risk aversion parameter of your utility
function?’. There are gambling questions that can elicit the degree of risk aversion less baldly. For example,
consider a bet where one has probability p > 1/ 2 of winning some large amount X . For a given value of p ,
how much money would one be prepared to bet? The expectation of the exponential utility function is
66
1 + (1 − 4 p(1 − p ))1 / 2
.
2(1 − p)
By considering ‘bets’ as closely related to the subject area as possible, an appropriate value of η for
management could be imputed. However, determination of utility function parameters is difficult. A cynical
colleague has commented that the trick is not to ask too many questions!
Instead of this, figure 2 could be plotted for several different maintenance policies, or for a proposed new
policy and for an existing policy where the certainty-equivalent cost per unit time is estimated from cost data,
without the necessity of building a model. A policy may be desirable in terms of mean cost per unit time, but it
should also be robust in that D should not increase rapidly as the degree of risk aversion increases.
unchanged (one would be indifferent between betting and not betting) when exp(ηX ) =
4. Conclusions
This article presents ongoing work in the arena of risk-averse maintenance. The aim here has been to present the
methodology for practitioners, by summarising the necessary mathematics, and by suggesting the use of a
graphical aid (figure 2) in choosing a maintenance policy. It may well be difficult to decide exactly what the
value of the risk aversion parameter η should be. Thought experiments in which for example gambling tasks
are contemplated can help here. However, the main suggestion here is that the certainty-equivalent cost of
various maintenance policies should be plotted against the degree of risk aversion. It can then be seen which
policies are not robust, in that their cost increases dramatically under moderate risk aversion.
References
Barlow, R. E. and Proschan, F. (1965) Mathematical Theory of Reliability, Wiley, New York
Baker, R. D. (2007) Inspection Policies for Reliability, Encyclopedia of Statistics in Quality and Reliability, Wiley
Baker, R. D. (2006) Risk aversion in maintenance: overmaintenance and the principal-agent problem, IMA J. of Management
Mathematics, 17 (2), 99 - 113
Baker, R. D. (2006) Risk aversion in maintenance, International J. of Polish Academy of Sciences ’Maintenance and Reliability’, 14-16
Chun, Y. H. and Tang, K. (1995) Determining the optimal warranty price based on the producer’s and customers’ risk preferences
European J. of Operational Research, 85 (1), 97-110
Dohi, T., Watanabe, A. and Osaki, S. (1994) Risk averse newsboy problem, Rairo-recherche operationnelle-operations research, 28 (2),
181-202
Dionne, G. and Mounsif, T. (1996) Investment under demand uncertainty: the newsboy problem revisited, Geneva papers on risk and
insurance theory, 21 (2), 179-189
Eeckhoudt, L., Gollier C. and Schlesinger H. (1995) The risk-averse (and prudent) newsboy, Management Science, 41 (5), 786-794
Keller, J. B. (1973) Optimum checking schedules for systems subject to random failure, Management Science, 21, 256-260
Keren, B. and Pliskin, J. S. (2006) A benchmark solution for the risk-averse newsvendor problem, European J. of Operational Research,
174, 1643-1650
Nakagawa, T. (2005) Maintenance Theory of Reliability, Springer, London, 201-229
Padmanabhan, V. and Rao, R. C. (1993) Warranty policy and extended service contracts: theory and an application to automobiles,
Marketing Science, 12 (3), 230-247
67
Stochastic demand patterns for Markov service facilities with neutral and active
periods1
Attila Csenki
School of Computing and Mathematics, University of Bradford, Bradford BD7 1DP, UK
[email protected]
* 1. This conference paper is a shortened version of a paper currently under review with a journal.
Abstract: In an earlier paper, Csenki (2007), a closed form expression was obtained for the joint interval
reliability of a Markov system with a partitioned state space S = U ∪ D , i.e. for the probability that the system
will reside in the set of up states U throughout the union of some specific disjoint time intervals. These
deterministic time intervals formed a demand pattern specifying the desired active periods. In the present paper,
we admit stochastic demand patterns by assuming that the lengths of the time intervals, that is the active
periods, as well as that of the neutral periods, are random. We explore two mechanisms for modelling random
demand: (1) by alternating renewal processes; (2) by sojourn times of some continuous time Markov chain with
a partitioned state space. The first construction results in an expression in terms of a revised version of the
moment generating functions of the sojourns of the alternating renewal process. The second construction
involves the probability that a Markov chain follows certain patterns of visits to some groups of states and
yields an expression using Kronecker matrix operations. The model of a small computer system is considered to
exemplify the ideas.
1. Introduction
In an earlier paper, Csenki (2007), a closed form expression was introduced for the joint interval reliability of
systems modelled by a finite state Markov process X. The system's state space was partitioned into up and down
states, S = U ∪ D , and of concern in Csenki (2007) was the probability that X will reside in the set of up
states U throughout the union of a given disjoint set of k time intervals I λ = [θ λ ,θ λ + ς λ ] , λ = 1,..., k . The
time intervals were termed a demand pattern specifying the system's active periods and they were taken to
model customer expectations. X was termed the supply process. This paper is a continuation of work reported
in Csenki (2007), building upon the result therein. It stems from the recognition that in most cases the demand
pattern is best described by a stochastic rather than a deterministic process.
The systems modelled by the present paper are Markovian service facilities subject to random demand
where the result of the service cannot be stored and is lost if not consumed immediately. Such systems are, for
example, wind farms or resourcesin organizations where resource unavailability results in immediate loss of
custom. Queueing or storage is not possible. The `commodity' produced is continuous: it is simply the
instantaneous uptime or availability of the facility.
The Markovian supply is described in Sect. 2.1. The demand patterns comprise two kinds of random
intervals interlaced: neutral and active periods. We introduce in Sect. 2.2 two alternative modes of constructing
demand patterns: Scheme 1 is based on alternating renewal processes, and, Scheme 2 is based on the sojourn
times of a finite irreducible Markov chain alternating between some subsets of states N (`neutral') and A
(àctive'). (The latter construction is built, in a sense, `to match' the supply process.) In Sect. 2.3, we define the
quality of service to be the measure of compliance C (k ) as the probability that the facility will meet demand in
the first k active periods. In Sect. 3, we give closed form expressions for C (k ) . For Scheme 1, the result will
be represented in terms of the revised moment generating function. For Scheme 2, C (k ) is an expression
involving Kronecker products of various submatrices of the rate matrices of both Markov processes modelling
supply and demand. Several special cases will be elaborated upon in the presentation. In Sect. 4, the theory is
applied to analyze the Markov model of a small computer system from Muppala et al. (1996).
2. Model construction
2.1. The supply process
The supply is modelled by an irreducible Markov chain X = { X (t ), t ≥ 0} whose finite state space S = U ∪ D
is partitioned into the set of up states U and its complement D , the set of down states. The system's transition
rate matrix Λ is partitioned thus
Λ UD ⎞
⎛Λ
⎟⎟
Λ = ⎜⎜ UU
(1)
⎝ Λ DU Λ DD ⎠
X is started according to some probability vector α at time zero. This is a row vector of length S .
68
2.2. The demand process
2.2.1. Modelling customer requirements: neutral and active periods
In Csenki (2007) we have introduced the notion of a demand pattern as a finite sequence of finite time intervals
during each of which the facility is expected by the customer to be in a definite subset of the state space S . The
intervals were assumed deterministic in Csenki (2007). Now we examine the case when the demand pattern
comprises random time intervals. Random demand patterns are formed by interlacing neutral and active periods:
we start with a neutral period of random length ξ1 , followed by an active period of random length ζ 1 , followed
by a neutral period of random duration ξ 2 etc. During a neutral period, there is no expectation concerning the
facility from the customer side, the supply process can be anywhere in S . During an active period the supply
process is hoped by the customer to be in the set of up states U . The lengths of the neutral and active periods
are respectively denoted by ξ1 , ξ 2 ,... and ς 1 , ς 2 ,... . These random variables form the demand process W and
they are assumed independent of the supply process X .1
We consider three modelling alternatives.
2.2.2. Scheme 0: deterministic demand
This case was considered earlier (Csenki (2007)): the interval lengths ξi and ζ i are deterministic and so are the
2k interval endpoints:
λ−1
λ
i =1
i =1
θ λ = ∑ (ξi + ς i ) + ξ λ , ωλ = ∑ (ξi + ς i ) , λ = 1,..., k .
(2)
2.2.3. Scheme 1: alternating renewal process
W may be assumed an alternating renewal process for modelling periodic customer requirements comprising
stochastically identical independent cycles. The intervals’ durations are the independent sequences
ξ = (ξ1 , ξ 2 ,...) and ς = (ς 1 , ς 2 ,...) , each comprising independent and identically distributed random variables.
2.2.4. Scheme 2: Markov sojourn times
To allow for the possibility of dependence between cycle lengths, here it is assumed that the random variables
ξ1 , ξ 2 ,... and ς 1 , ς 2 ,... are the respective sojourn times of an irreducible Markov process Y = {Y (t ), t ≥ 0} in
the disjoint subsets N , A ⊂ T , where T = N ∪ A is the finite, partitioned state space of Y . The transition rate
matrix of Y is denoted by Ψ ; it is in a partitioned form
⎛Ψ
Ψ = ⎜⎜ NN
⎝ Ψ AN
Ψ NA ⎞
⎟.
Ψ AA ⎟⎠
(3)
Y alternates between N and A indefinitely, thereby generating the sequences of neutral and active periods. Y
is started at time zero in N according to some probability (row) vector β N . The overall initial probability
vector of Y is therefore
β = (β N ,0,...,0) = (β N ,0 A ) .
(4)
The sojourn times of Y in A define the active periods; they are the time intervals I λ = [θ λ , ωλ ] .
We add in passing that the random variables forming the sequences ξ and ς are phase type distributed,
and, in general, dependent. If the subsets N and A are entered into by Y always through the same respective
states, ξ and ς define an alternating renewal process of phase type distributed random variables. This is then a
special case of Scheme 1 from Sect. 2.2.3. For recent applications of phase type distributions in the reliability
setting, refer to Montoro-Cazorla & Perez-Ocon (2006), Perez-Ocon & Montoro-Cazorla (2004a,b) and PerezOcon & Ruiz Castro (2004).
2.3. The compliance measure C (k )
We will evaluate in Sect. 3 expressions for the probability of supply meeting demand during the first k active
periods, i.e. for
k
C (k ) = P( X (t ) ∈ U for all t ∈ Υ[θ i , ωi ]) .
i =1
69
(5)
C (k ) is interpreted as a measure of the supply complying with demand.
3. Model analysis
3.1. Scheme 0
In this case, C (k ) is the joint interval reliability over the deterministic intervals [θ i ,θ i + ς i ] , i = 1,..., k . From
our earlier work we know that this is
k
C (k ) = αexp(ξ1Λ )I SU exp(ς 1ΛUU )∏{IUS exp(ξ λΛ )I SU exp(ς λΛUU )}1U .
(6)
λ= 2
3.2. Scheme 1
Because of the independence of supply and demand and because of the independence of the random durations
constituting the demand process itself, (6) can be integrated termwise to give
C (k ) = αM ξ ( Λ )I SU M ξ ( ΛUU ){IUS M ξ ( Λ )I SU M ς ( ΛUU )}k −11U ,
(7)
where
M κ (Z) = E (exp(κZ))
(8)
stands for the revised moment generating function (mgf) of the random variable κ . In contrast to the usual
practice, in (8) the matrix exponential is used and Z , the argument of M κ , is a square matrix.
The special cases where the demand periods are Gaussian, shifted Erlang or two-phase Coxian will be
specifically addressed in the presentation as they will be used in the application.
3.3. Scheme 2
3.3.1. Computation of C
The combined, bivariate process Z (t ) = ( X (t ), Y (t )) models the interaction between supply and demand. It has
state space S × T , the Cartesian product of the individual state spaces. And, because of the independence of the
component processes, Z is a Markov process with transition rate matrix
Γ = Λ ⊕ Ψ = Λ ⊗ ITT + I SS ⊗ Ψ ,
where ⊕ and ⊗ stand, respectively, for the matrix operations Kronecker sum and Kronecker product (Graham
(1981), Keller & Qamber (1988)).
From the many results for these operations, we need the ones for representing blocks of matrices thus
generated: for any two square matrices Λ and Ψ of respective size S × S and T × T , it is for S1 , S 2 ⊂ S
and T1 , T2 ⊂ T ,
( Λ ⊗ Ψ) S1 ×T1 S 2 ×T2 = Λ S1 S 2 ⊗ ΨT1 T2 ,
(9)
( Λ ⊕ Ψ ) S1 ×T1 S 2 ×T2 = Λ S1 S 2 ⊗ IT1 T2 + I S1 S 2 ⊗ ΨT1 T2 .
(10)
The initial probability vector of Z is α ⊗ β ; this is a row vector of length S × T . By (4), the A -entries of the
initial probability vector of Z are zero.
In the presentation we shall indicate that the quantity of interest, C (k ) , is expressible in terms of an event
involving Z . Then, after some work, the following formula is obtained for C (k ) ,
C (k ) = −(α S ⊗ β N )(Γ −EE1 Γ EF Γ −FF1 Γ FE ) k −1 Γ −EE1 Γ EF (1U ⊗ 1 A ) ,
(11)
Γ EE = Λ SS ⊗ I NN + I SS ⊗ Ψ NN ,
Γ EF = I SU ⊗ Ψ NA ,
Γ FE = IUS ⊗ Ψ AN ,
Γ FF = ΛUU ⊗ I AA + IUU ⊗ Ψ AA ,
Γ FG = IUD ⊗ Ψ AA .
(12)
where
(13)
(14)
(15)
(16)
The assumed initial condition (4) may appear restrictive in that demand patterns starting with an active
period are disallowed. It will be indicated in the presentation that this restriction on Y can be overcome.
70
Two examples for demand patterns of the type discussed here will be spelt out in more detail in the
presentation: we call them rest-then-work and work-then-rest as they evoke these respective behavioural
patterns.
4. Application: a small computer system
4.1. The Markov system
The Markov model of a computer system comprising three workstations connected to a file sever via a computer
network will be considered here; the model (in its basic form) stems from Muppala et al. (1996). System
components fail independently of each other. Each of the workstations fails at a rate of ρ w , whereas the failure
rate of the file server is ρ f . The computer network is very reliable and it may be assumed that it never fails.
There is a single repairman to look after the system. The repair rate of a workstation is µ w . Repair of a failed
workstation commences immediately upon its failure if the repairman is not busy. In the latter case, he starts
repairing the waiting workstation as soon as he has finished repairing the other item. It is assumed that the
system is functional if at least one of the workstations and the file server are in the up state. Furthermore, a
component cannot fail while the system is in the down.
The states of the modelling process X will be denoted by pairs (i, j ) ; i = 0,1,2,3 refers to the number of
workstations in the up state, j indicates the state of the file server: zero for ‘down’, one for ‘up’. As
components cannot fail during system down time, the state (0,0) is not reachable from any of the other states; it
will therefore be ignored.
Repair of the file server enjoys priority. This policy becomes operational if the fileserver fails while one of
the workstations undergoes repair while at least one of them is still functional. (This corresponds to the
transition (i,1) → (i,0) , i ∈ {1,2} .) Then, the ongoing repair of the workstation is abandoned and resumed later
(from scratch) once the file server has been repaired.
The system's transition rate diagram is shown in Fig. 1. (Down states are indicated by dashed lines.) Notice that
failure rates (denoted by ρ ) are associated with transitions where one of the entries in the pair is decremented.
Repair rates (denoted by µ ) are associated with transitions where one of the entries in the pair is incremented.
The system’s transition rate matrix Λ is easily constructed from Fig. 1. Initially, all components are up, i.e.
α = (1,0,...,0) . The parameters’ numerical values are adopted from Muppala et al. (1996).
4.2. Demand patterns
We examine the demand process {(ξ1 , ς 1 ), (ξ 2 , ς 2 ),...} under various distributional assumptions. For reasons of
comparison, the respective mean periods will be the same under all schemes. Two scenarios will be considered:
1. The interval lengths are random but ‘almost’ deterministic and therefore duration variances are small.
The supplier can reasonably ‘guess’ when and for how long the sevice will be in demand. The Schemes
0, 1 (G) and 1 (E) (see below) will be used to model this situation.
2. There is more uncertainty here about the durations of neutral and active periods. Schemes 1 (C) and 2
will be used to model this situation. Scheme 2 will allow also dependence of active and neutral periods
to be modelled.
71
Figure 1. Markov model of the computer system
Scheme 0. Neutral and active periods are deterministic and of lengths 8 and 16 hours, respectively.
Expression (6) applies.
Scheme 1. Neutral and active periods are interlaced to form an alternating renewal process. Expression (7)
applies. We consider three cases.
(Gaussians) Neutral and active durations are normally distributed with respective means a and
(G)
2a . Their variances, denoted by σ 2 and τ 2 , may be set, for example, to the values 1hr 2 and 2 hr 2 ,
respectively, so as to model small uncertainties in the periods’ durations. The parameter a will be set to
8 hr .
(E) (Shifted Erlangs) Neutral and active durations are shifted Erlang distributed with respective set of
parameters: ni exponential stages, each with rate µi , and the resulting Erlang distribution then shifted by
ai , i = 1,2 . Choose the parameters such that means and variances match those in Scheme 1 (G); for
example,
E (ξi ) = a1 + n1 / µ1 = 8 hr , Var (ξi ) = n1 / µ12 = 1 hr 2 ,
E (ς i ) = a2 + n2 / µ 2 = 16 hr ,
Var (ς i ) = n2 / µ 22 = 2 hr 2 .
A set of parameters conforming to (17) and (18), respectively, is
a1 = 6 hr ,
n1 = 4 ,
µ1 = 2 / hr ,
a2 = 14 hr ,
n2 = 2 ,
(17)
(18)
(19)
µ 2 = 1 / hr .
(20)
(C)(Two-phase Coxians) Neutral and active durations are two-phase Coxian distributed. They model large
random variatons in the intervals’ durations.
Scheme 2. The rest-then-work demand pattern will be discussed here.
4.3. Implementation and numerical results
In the conference presentation unreliabilities will be used: these are the complementary probabilities 1 − C (k ) .
The numerical results will be shown and discussed there. They were obtained by SCILAB (Campbell et al.
(2006), Pincon (2003)), a public domain software for numerical work with emphasis on matrix representations.
5. Conclusions
We have developed formulae for the probability that a system modelled by a Markov process with a partitioned
state space will be in the set of up states throughout the first k active periods where these are stochastic and are
generated either by an alternating renewal process, or, are sojourn times of another Markov process with a
partitioned state space. Potential future developments will be discussed in the presentation.
72
Acknowledgements
I thank Dr. David Jerwood, Head of Mathematics at the University of Bradford, for the financial assistance
which allowed me to attend the IMA conference MIMAR 2007 and to present this paper there.
References
Campbell, S. L., Chancelier, J.-Ph. and Nikoukhah, R. (2006) Modeling and Simulation in Scilab/Scicos, Springer, Heidelberg, New
York
Csenki, A. (2007) Joint interval reliability for Markov systems with an application in transmission line reliability, Reliability Engineering
and System Safety, 92, 685-696
Graham, A. (1981) Kronecker products and matrix calculus with applications, Ellis Horwood/Wiley, Chichester, New York
Keller, A. Z. and Qamber, I. S. (1988) System Availability Synthesis. In: Tenth Advances in Reliability Technology Symposium,
Proceedings of the 10th Advances in Reliability Technology Symposium, University of Bradford, 6-8 April 1988, (ed. Libberton,
G. P.), 173-188. Elsevier Applied Science, London, New York
Montoro-Cazorla, D. and Perez-Ocon, R. (2006) A deteriorating two-system with two repair modes and sojourn times phase-type
distributed, Reliability Engineering and System Safety, 91, 1-9
Muppala, J. K., Malhotra, M. and Trivedi, K. S. (1996) Markov dependability models of complex systems: analysis techniques, In NATO
Advanced Science Institute Series, Series F: Computer and System Sciences 154, Proceedings of the NATO Advanced Study
Institute on Current Issues and Challenges in the Reliability and Maintenance of Complex Systems, Kemer-Antalya, Turkey, 12-22
June, 1995 (ed. S. Ozekici), Springer, Berlin, Heidelberg, 442-486
Perez-Ocon, R. and Montoro-Cazorla, D. (2004a) A multiple system governed by a quasi-birth-and-death process, Reliability
Engineering and System Safety 84, 187-196
Perez-Ocon, R. and Montoro-Cazorla, D. (2004b) Transient analysis of a repairable system, using phase-type distributions and geometric
processes, IEEE Transactions on Reliability 53, 185-192
Perez-Ocon, R. and Ruiz Castro, J. E. (2004) Two models for a repairable two-system with phase-type sojourn time distributions,
Reliability Engineering and System Safety 84, 253-260
Pincon, B. (2003) Eine Einfuhrung in Scilab, (translated from the French by Jarausch, H.), Insitute Elie Cartan Nancy, Universite Henry
Poincare, France, 2003. http://www.scilab.org/publications/JARAUSCH/PinconD.pdf
73
Multicriteria decision model for selecting maintenance contracts by applying
utility theory and variable interdependent parameters
Anderson Jorge de Melo Brito, Adiel Teixeira de Almeida
Federal University of Pernambuco, Cx. Postal 7462, Recife – PE, 50.630-970, Brazil
Abstract: The contracts selection is a very important stage for the process of maintenance outsourcing in the
current trend towards reducing cost and increasing competitiveness by focusing on core competences. The
prominence of this theme can be seen in many studies carried out on outsourcing and maintenance contracts,
most of which dealing with qualitative aspects. However, quantitative approaches, such as Multicriteria
Decision Aid, play an important role in helping decision makers to deal with multiple and conflicting criteria
and uncertainties in the selection process for outsourcing contracts. In this context, several decision models have
been developed, using Utility Theory, ELECTRE and other multicriteria methods. This paper presents a
multicriteria methodology to support the selection of maintenance contracts in a context where information is
imprecise, when decision makers are not able to assign precise values to importance parameters of criteria used
for contract selection. Utility Theory is combined with the Variable Interdependent Parameters Method to
evaluate alternatives through an additive value function regarding interruption time, contract cost and
candidate’s dependability. To illustrate the use of the model, a numerical application with VIP Analysis software
is presented.
1. Introduction
Following the wider trend of outsourcing non-core-competences as a strategic policy, several companies are
nowadays establishing contracts with external firms to perform repair and maintenance services on their systems.
The main objective of maintenance outsourcing is, in general, to increase the availability of such systems
through a better maintenability structure offered by specialized staff, at costs lower than those related to the use
of in-house professionals. Therefore, the process of selecting maintenance service outsourcing, as well as
supplier selection, often presents a conflict between assessment criteria, and a conflicting and uncertain
performance of contract alternatives on these criteria. These factors feature the selection of maintenance and
repair contracts as a multiple criteria decision-making process (Weber et al. (1991), Huang & Keskar (2007)).
However, as Almeida (2005) pointed out, most studies found in the literature approaches this theme, as
well as the supplier selection problem, using qualitative aspects (see for example, Kennedy (1993)). Few papers
have analysed these problems by exploring a quantitative approach with Multicriteria Decision Aid (Keeney &
Raiffa (1976), Vincke (1992)). For choice of supply contracts, Ghodsypour & O’Brien (1998) have presented an
Analytic Hierarchy Process and Linear Programming-based model in order to consider qualitative and
quantitative factors in supplier selection. De Boer et al. (1998) addressed the use of outranking multicriteria
methods for such kind of problems, and discussed the advantages of multicriteria methods in relation to
traditional decision models. Huang & Keskar (2007) have presented a set of configurable metrics to assess
supplier performance and to help decision-making methodologies for choice of suppliers.
Regarding maintenance contract selection, little work has been conducted exploring a multicriteria decision
making approach. Almeida (2001) has presented multicriteria decision models based on Multiattribute Utility
Theory (Keeney & Raiffa, 1976) for selecting repair contracts, which aggregate interruption time and related
cost through an additive utility function. A different approach can be found in Almeida (2002), where the
ELECTRE I method has been combined with utility functions regarding a repair contract problem.
In this paper, we consider the use of a multicriteria decision approach through an additive value function to
help decision makers find the most preferred alternative in selecting a maintenance contract. Additive value
functions are a widespread and well-known approach in problems concerning ranking and choice, according to
multiple criteria or attributes (Dias & Clímaco (2000)). However, when the set of criteria to assess available
alternatives is established, it can be seen that decision makers may not only find it difficult to provide precise
information about their preferences, but these preferences may change at any point during the decision making
process. In many cases, the elicitation procedures of parameters of criteria importance may take more time and
require more patience than decision makers are willing to provide (Dias & Clímaco (2000)).
This paper presents a multicriteria model to support decision making in selecting maintenance and repair
contracts under the situation mentioned. Utility functions, to reflect decision makers’ preferences and their
behavior regarding risk, are combined in an additive value function with imprecise information on the scaling
parameters of criteria. To illustrate the model, a numerical application with VIP Analysis, a decision support tool
developed for a variable interdependent parameters approach (Dias & Clímaco (2000)), is performed.
74
2. The problem analysed
The problem under study deals with selecting outsourced maintenance contracts for a repairable system. It is
being widely recognized that the contract cost must no longer be taken as the only aspect to guide decisions on
contract selection. In industrial and mainly in service systems, credibility, delivery time, customer satisfaction,
quality and other non-monetary aspects are often affected by system availability, and they play an important role
in a company’s competitiveness. Hence, companies must be concerned with all aspects which may influence the
availability of their system, so contract selection must take into consideration multiple and often conflicting
objectives.
In this process, the decision maker (DM) faces several options for maintenance contracts, each implying
different system performances and related costs. The DM has to choose the option most preferred, the one with
the best combination of contract conditions (Almeida (2002)). These conditions may vary depending on the
company’s market and strategy, and may involve: delivery speed or response time, quality, flexibility,
dependability and obviously, cost (Slack & Lewis (2002), Almeida (2005)).
The set of actions, represented here as A, corresponds to the set of all maintenance contract alternatives
available to the decision maker. Assuming that A is composed for n alternatives (n>1), then A is a discrete set
represented by A={a1, a2, a3,…, an}, where ai is any contract option in A. Each alternative ai presents a contract
cost and other performances on the set of criteria considered.
Almeida (2001, 2002) has modeled this problem by considering response time and contract cost, and using
different methods and different probabilistic assumptions for response time. The decision model proposed here
includes three basic criteria: interruption time, applicant’s dependability considering a contract option and
contract cost. It is assumed here that these three criteria are satisfactorily comprehensive to allow the decision
maker to assess alternatives and quantify the consequences of choosing any action ai in A.
The interruption time, here represented by TI , corresponds to the interval when system is not working due
to repair or other maintenance activities. It is related to system availability. Assuming that reliability is an
inherent project feature of a company’s equipment, than it is assumed to be the same for each alternative ai . So
the alternative’s maintenability structure, represented by TI , is the discriminating factor for availability. TI is
related to the speed of a repair facility for each alternative ai (Almeida (2005)). It is influenced by staff training,
the applicant’s repair facilities and spares provisioning. In this paper, it is assumed that administrative delay
time TD (Almeida (2001)) is negligible in relation to time to repair TTR, so we consider TI = TTR. For each
contract alternative ai ,TI is assumed to be a random variable. The uncertainty related to TI will be incorporated
through a probability density function fi(ti ) for each contract alternative ai.
The dependability criterion is used to assess contract alternatives in relation to “deadlines” being met. It is a
measure related to keeping delivery promises (Slack & Lewis (2002), Almeida (2005)). Assuming that TI = TTR
dependability will be represented by the probability di of achievement of time to repair under a specified
probability distribution, as undertaken in the contract proposal related to ai (Almeida (2005)).
A contract cost ci, presented by each applicant, represents the remuneration of the maintenance structure
that will be available to the contracting company. The cost ci is appraised for a period of time during which
repair activities are under warranty of being performed according to contract conditions associated with TI.
Therefore, each contract alternative ai may have its performance represented by a multi-dimension vector
comprising: a parameter (or parameters) related to the probability density function of the interruption time ti; a
probability di of achievement of interruption time as undertaken in the contract proposal; and a contract cost ci .
For each alternative ai, ci is assumed to be a fixed value, which is the cost of the contract proposal i.
The three criteria presented in this model may be conflictive among contract alternatives. Usually, lower
interruption times (in this model, times to repair) are related to better resource conditions, better spares
provisioning and higher professional skills, and they often imply higher costs. Besides, the dependability of the
alternative is not directly related to the proposal conditions associated with interruption time, but it is assessed
by the contracting company taking into consideration other aspects such as the applicant’s reputation, previous
services, the structure of repair facilities etc. This problem is analysed by means of a multicriteria decision
model. The model seeks to be adapted to the decision maker’s inability to fix constant values for criteria
“weights” that must translate not only the importance of criteria, but also compensation rates between criteria in
additive value functions (Vincke (1992)). This flexibility in modeling is obtained by using the Variable
Interdependent Parameters approach, presented by Dias & Clímaco (2000) and discussed as follows.
3. The Variable Interdependent Parameters method
The Variable Interdependent Parameters (VIP) is a compensatory method where the coefficients of an additive
value function (scale constants or “weights”) are treated as interdependent parameters subject to constraints
imposed by the decision maker’s preferences structure (Dias & Clímaco (2000)). Unlike MAUT’s additive
75
utility function (Keeney & Raiffa (1976)), the logic of the VIF Method considers the assessment of the
alternative as being not only a function of its performance on the criteria but also as a function of the
coefficients of the criteria.
This evaluation is made by means of an additive value function which takes the coefficients of the criteria
as variables, by assessing all possible combinations of these parameters within a vectorial space, a “weights
space” allowed by the decision maker.
The VIP method explores the combinations of all parameters for what the decision maker has expressed
indifference through a set of constraints. The VIP’s additive value function is obtained as in equation (1);
V (ai , k ) =
n
∑ k j u j ( g ij )
(1)
j −1
Where ai is any alternative of the decision problem, k = (k1, k2, …, kn) is a point in the decision set K (the space
of coefficients informed by the decision maker) and uj(gij) is the utility of the performance of alternative i in
criterion j.
Imprecise information in this method are related to the criteria coefficients k = (k1, k2, …, kn), assuming that
K is bounded by linear constraints. These constraints imposed by the decision maker may bind criteria
coefficients through upper and lower limits for coefficient values, through a ranking of these parameters or
through restrictions on criteria trade-offs. Then, to assess the global performance of an alternative, the VIP
method makes use of four approaches that complement each other in order to obtain rich and robust conclusions
supporting the resulting choice. These approaches are briefly described below (Dias & Clímaco (2000)):
• Approach based on optimality: Searches for the alternative presenting the best performance in the additive
value function for all k ∈ K . Although it is not easy to find such an alternative, this approach helps to
identify and then eliminate dominated alternatives, thus reducing the set of alternatives under analysis;
• Approach based on pairwise comparison : This explores the subset of K that favour each alternative when
two alternatives are compared. It helps to identify, inside the K set, the greater advantage of choosing an
alternative ai in relation to another specified alternative aj.
• Approach based on variation ranges: Exploring all possible parameters combinations in K, this approach
permits an observation of which alternatives are most affected by coefficient variations, which avoids
subsequent sensivity analysis.
• Approach based on pessimistic aggregation rules: This approach helps to identify, for each alternative ai, the
greatest difference within K between the global performance of ai and the global performance of all other
alternatives with a higher value in the additive functions. Due to imprecise information related to criteria
coefficients, an alternative ak with the lowest relative disadvantage regarding global performance may be
recommended under this approach.
4. The proposed decision model
Firstly, a preference and probabilistic modelling is performed for each criterion of the decision model. Applying
elicitation procedures, such as described in Keeney & Raiffa (1976), a utility function for each criterion is
obtained from the decision-maker. The parameters and shape of the three utility functions result from the
elicitation procedures (Almeida (2005)). Using these functions, the decision-maker’s preference modeling is
performed and his behavior (averse, neutral or prone) regarding risk is incorporated (Keeney & Raiffa (1976)).
For interruption time and contract cost criteria, the exponential utility function has been found in Almeida
(2001) for the utility functions U(ti) and U(ci), and they are given as follows;
U ( t i ) = e − A1 t i
U (ci ) = e − A2 c i
(2)
(3)
Often found in practice (Keeney & Raiffa (1976)), exponential utility functions mean that higher values of
consequences are much more undesirable for the decision maker than lower ones (Almeida (2002)). In this
problem of maintenance contract selection, it is assumed that each alternative ai has its particular contract cost,
which is a constant value. Hence, the cost criterion is evaluated directly for each alternative through expression
(3). However, for the interruption time criterion, the evaluation of each alternative ai must be based on the
probabilistic feature of TI. This implies that a probability density function fi(ti ) must be taken into account.
In this paper, it is assumed that TTR (and consequently TI, for it has been assumed that TI =TTR) follows a
Gamma distribution function, with shape parameter n=2 for any alternative. The other parameter, ui, may be
different for each contract option ai. This assumption is reasonable in practical situations where interruption
time is concentrated around a modal value. Thus, the p.d.f. of TI for each alternative is given as follows;
76
f ( t i ) = u i2 te − u i t i
(4)
Therefore, ti is not directly evaluated for each alternative ai, but the evaluation of contract alternatives on
this criterion is based on parameter ui through the utility function U(ui). This is derived from U(ti) applying the
linearity property of utility theory (Berger (1985)), as presented in expression (5) (Almeida (2005));
∞
U ( u i ) = ∫ U ( t i ) Pr( t i | u i ) dt i
(5)
0
In expression (5), Pr(ti| ui) is represented by fi(ti ). So, replacing (2) and (4) into (5) and solving the resulting
integration it follows that
U ( u i ) = u i 2 /( A1 + u i ) 2
(6)
Thus, although the DM express his preferences on ti by U(ti), each alternative’s performance on this
criterion is analysed through U(ui) by means of expression (6).
Regarding the dependability criterion, it is assumed in this paper that there is an uncertainty related to the
real value of ui and this uncertainty can be expressed through a prior probability density function on ui (Almeida
(2005)).This probability, represented by π(ui), can be obtained by means of Bayesian elicitation procedures
(Keeney & Raiffa (1976)). Then, for each alternative ai, di is expressed as the probability that u i ≥ u ic , where
uic is the parameter value specified for ui in the contract proposal. For this kind of criterion, as pointed out by
Almeida (2005), a logarithm utility function is often found in practice (Keeney & Raiffa (1976)). So, the utility
of a dependability value di is given by means of expression (7) below;
U ( d i ) = B 3 + C 3 ln( A3 ⋅ d i )
(7)
Once U(ui), U(ci) and U(di) are obtained for each contract proposal ai, an inter-criteria assessment is
performed to provide a suggestion for the best alternative according to the decision maker’s preferences. It is
assumed in this paper that the decision maker presents a compensatory rationality represented by an additive
value function. However, when such a type of function is adequate to get a global performance for each contract
alternative, it can be seen that the decision maker may find it hard to set fixed values for criteria coefficients,
and to inform the relative importance of criteria. He may feel not comfortable about saying how disposed he is
to losing in one criterion in order to obtain a gain in another one. This situation is framed in a context of partial
information related to the scale of criteria coefficients used in an additive model. For tackling the selection of
maintenance contracts in this context, the Variable Interdependent Parameters Method is used here to support
the decision process with regard to the choice of the contract proposal most preferred. So, the additive value
function used to obtain a global assessment for each contract alternative is presented in expression (8);
(8)
V ( a i , k 1 , k 2 , k 3 ) = k 1U u I ( u i ) + k 2 U ci ( c i ) + k 3U d i ( d i )
With this expression, the global performance of each contract proposal is not only a function of its performance
on interruption time, cost and dependability criteria, but it is also a function of the coefficients of the criteria.
These are not fixed, but they are limited in a three-dimension space defined by preference constraints which are
defined by the decision maker as an input to the model.
5. A numerical application
A numerical application is presented to illustrate the use of the proposed decision model. A DM has to select a
maintenance contract where 6 contract alternatives, given in Table 1, are available. Values on criteria and the
utilities of performance are given for all alternatives. Utility values have been assessed after applying an
elicitation procedure as described in Keeney & Raiffa (1976). From this elicitation procedure, the parameters
obtained for the utility functions were: A1=0.09; A2=0.004; A3=55; B3=-4.2; C3=1.3. Once utility values for all
alternatives in all criteria are obtained, a multicriteria analysis through a variable interdependent parameters
approach is performed. In this application, the software VIP Analysis (Dias & Clímaco (2000)), a decision
support tool that implements the VIP method, is used to undertake the calculations and to display the results.
Alternative
a1
a2
a3
a4
a5
a6
ui
0.9
0.75
0.75
0.5
0.7
0.3
ci
175
100
120
75
75
60
di
0.95
0.80
0.85
0.70
0.65
0.70
U(ui)
0.83
0.80
0.80
0.72
0.79
0.59
U(ci)
0.50
0.67
0.62
0.74
0.74
0.79
U(di)
0.94
0.72
0.80
0.55
0.45
0.55
Table 1. Performances of maintenance contract alternatives
77
If the decision maker felt secure about setting fixed values for the three criteria coefficients of the additive
model, the global performances of alternatives could be ranked in order to obtain the one with the highest
performance. However, in this problem, the decision maker has only felt able to record upper and lower limits
for each criteria coefficient, and to rank these criteria, in order of importance, as listed as follows: cost,
interruption time and dependability. This ranking can be expressed by k2 ≥ k1 ≥ k3. This information is shown in
Table 2, and with the addition of a normalization constraint, it represents a set of constraints which places limits
on a coefficients space in which robust conclusions will be sought.
Criterion
Interruption Time
Contract Cost
Dependability
Coefficient
k1
k2
k3
Importance Order
2o
1o
3o
Lower Bound
0.25
0.40
0.10
Upper Bound
0.60
0.80
0.50
Table 2. Upper and lower bounds for criteria coefficients
The information presented in Table 1 and Table 2 is then inserted into VIP Analysis software for a
multicriteria assessment using the Variable Interdependent Parameters Method. Even without fixing exact
values for criteria coefficients, some conclusions about the range of the global performance of alternatives can
be drawn from Figure 1 below.
Figure 1. Minimum and maximum values for alternatives with an imprecise additive value function (left side); and
performance ranges (right side)
From Figure 1, it can be seen that there are no dominated alternatives. Alternative a2, the results of which
are highlighted, presents the highest minimum value of global performance within the whole coefficients space
bounded by the decision maker. On the right side of Figure 1, alternative a2 is also displayed as presenting the
narrowest performance range under the coefficient variation, which indicates a2 as the alternative with the most
robust performance from all contract alternatives.
In Figure 2, two results are shown from a pairwise confrontation and a maximum regret analysis with VIP
Analysis software. In a maximum regret analysis, the set of coefficients is observed in which each alternative is
dominated by the others, and the highest performance disadvantage between this alternative and all the others
stands out.
Figure 2. Pairwise confrontation table (left side); and maximum regret related to alternatives (right side)
78
From this analysis, it is also recommended that alternative a2 be selected, which presents the lowest
maximum regret. Therefore, within a context of imcomplete information related to the criteria coefficients of a
maintenance contract selection model, alternative a2 has been indicated for selection due to its having the
highest minimum global value, to its having the lowest performance variability and to its having the lowest
maximum regret when considering all the six alternatives for the maintenance contract.
5. Conclusions
This paper has presented a multicriteria decision model for dealing with the selection of maintenance contracts
within a context where the decision maker, using a compensatory rationality modeled by an additive value
function, is not able to give or does not feel comfortable about giving precise information about criteria
importance or criteria trade-offs. Interruption time, contract cost and dependability criteria have been used to
assess contract proposals. Utility Theory has been applied for assessing the utility of the performance of
alternatives in each criterion, which allowed the incorporation of the decision maker’s preferences and behavior
regarding risk.
The proposed model was applied using the Variable Interdependent Parameters Method and VIP Analysis
software. Using some coefficient constraints given by the decision maker, the model allowed robust conclusions
to be drawn about the performances of contract alternatives and to identify the best one regarding its performace
variability, maximum regret and relative advantages and disadvantages within all possible combinations of
criteria coefficients. Future studies may usefully explore additions of new criteria and the use of new
probabilistic assumptions to model the interruption time criterion. They may also explore possible model
adjustments in order to allow their application in a context of group decision making.
References
Almeida, A. T. (2001) Multicriteria decision making on maintenance: Spares and contracts planning, European J. of Operational
Research, 129 (2), 235-241
Almeida, A. T. (2002) Multicriteria modelling for repair contract problem based on utility function and ELECTRE I method, IMA J. of
Management Mathematics, 13, 29-37
Almeida, A. T. (2005) Multicriteria modelling of repair contract based on utility and ELECTRE I method with dependability and service
quality criteria, Annals of Operations Research 138, 113-126
Berger, J. O. (1985) Statistical Decision Theory and Bayesian Analysis, Berlin: Springer-Verlag
De Boer, L., Van Der Wegen, L. and Telgen, J. (1998) Outranking methods in support of supplier selection, European J. of Purchasing
& Supply Management, 4, 109-118
Dias, L.C. and Clímaco, J. N. (2000) Additive aggregation with variable interdependent parameters; the VIP analysis software, J. of the
Operational Research Society, 51 (9), 1070-1082
Ghodsypour, S. H. and O’Brien, C. (1998) A decision support system for supplier selection using an integrated analytic hierarchy process
and linear programming, International J. of Production Economics, 56-57, 199-212
Huang, S. H. and Keskar, H. (2007) Comprehensive and configurable metrics for supplier selection, International J. of Production
Economics, 105, 510-523
Kennedy, W. J. (1993) Modelling in-house vs contract maintenance with fixed costs and learning effects, IJPE 32 (3), 277-283
Keeney, R. L. and Raiffa, H. (1976) Decision with Multiple Objectives: Preferences and Value Trade-Offs, New York: John Wiley &
Sons
Slack, M. and Lewis, M. (2002) Operations Strategy, London: Prentice Hall
Vincke, P. (1992) Multicriteria Decision Aid, New York: John Wiley & Sons
Weber, C. A., Current, J. R. and Benton, W. C. (1991) Vendor selection criteria and methods, European J. of Operations Research, 50,
2-18
79
Spare parts planning and risk assessment associated with non-considering system
operating environment
Behzad Ghodrati
Div. of Operation and Maintenance Engineering, Luleå University of Technology, Luleå¨, 97753- Sweden
[email protected]
Abstract: Spare parts needs – as an issue in the field of product support – are dependent on the technical
characteristics of the product, e.g. its reliability and maintainability, and the operating environment in which the
product is going to be used (e.g. the temperature, humidity, and the user/operator’s skills and capabilities),
which constitute covariates. The covariates have a significant influence on the system reliability characteristics
and consequently on the system failure and number of required spare parts. Ignoring this factor might cause
irretrievable losses in terms of production and ultimately in terms of economy. This was proved by the event
tree risk analysis method used in a new and non-standard form in the present paper. It has been found that the
percentage of risk associated with not considering the system operating environment in spare parts estimation is
relatively high.
1. Introduction – Product support
Most industrial products and systems wear and deteriorate with use. In general, due to economical and
technological considerations, it is almost impossible to design a machine/system that is maintenance-free. In fact,
maintenance requirements come into consideration mainly due to the lack of proper designed reliability and
tasks performance quality. Therefore, the role of maintenance and product support can be perceived as the
process that compensates for deficiencies in design, with regard to the reliability of the product and the quality
of the output generated by the product (Markeset & Kumar (2003)). The product support and maintenance
needs of systems to a large extent are decided during the design and manufacturing phase (e.g. Blanchard (2001),
Blanchard & Fabrycky (1998), Goffin (2000), Markeset & Kumar (2001) and Smith & Knezevic (1996)).
The product support and service delivery performance in the operational phase can be enhanced through
better provision of spare parts and improvement of the technical support system. However, to ensure the desired
product performance at the lowest cost, we have to design and develop maintenance and product support
concepts right from the design phase. The existing literature appears to have paid little attention to the influence
of product design characteristics, influenced by the product operating environment, in the dimensioning of
product support, especially in the field of spare parts planning.
Therefore, spare parts needs (as an issue of product support) are dependent on the engineering
characteristics of the product (reliability and maintainability), the human factors (operators’ skills and
capabilities) and the environment in which the product is working. Therefore, product support specifications
should be based on the design specifications and the conditions faced by the customer. The risk associated with
the ignoring of the system operating environmental factors is remarkable and plays an important role in the cost
of operation (Ghodrati & Kumar (2005a)).
2. Operating environment
The operating environment should be seriously considered when dimensioning product support and service
delivery performance strategies. Generally, the recommended maintenance program for systems and
components is based on their age and condition without any consideration of the operating environment, which
leads to many unexpected failures. This creates poor system performance and a higher Life Cycle Cost (LCC)
due to unplanned repairs and/or restoration, as well as support. The environmental conditions in which the
equipment is working, such as the temperature, humidity, dust, the maintenance crew’s and operator’s skill, the
operation profile, etc., often have a considerable influence on the product failure behavior and thereby on the
maintenance and product support requirement (Ghodrati (2005), Kumar et al. (1992)). Furthermore, the
“Distance” in many ways (not only in geographical terms, but also in terms of infrastructure, culture, etc.) of the
user from the manufacturer/supplier can exert an additional influence on spare parts management.
3. Spare parts
Industrial systems and technical installations may fail and therefore repair is needed to retain in or restore them
to working condition. These systems and installations are also subject to planned maintenance. In most cases,
maintenance and repair require pieces of equipment to replace defective parts. The common name for these
parts is spare parts. They may be subdivided into:
80
1. Repairable parts: if a repairable item has failed and a shutdown of the system is unavoidable, the user has
to accept at least the time required to repair the item before the system is up again. In this situation the user only
has to wait for the time that it takes to repair the item.
2. Non-repairable parts, also called consumables, which are considered and studied in the present research.
In other words, we limit ourselves to non-repairable spare parts in the normal phase. If such a part fails, it is
removed and replaced by a new item.
Apparently, the control and management of spare parts constitute a complex matter. Common statistical
models for inventory control lose their applicability, because the demand process is different from that assumed
due to the machine characteristics, operating situation and unpredictable events during operation. An essential
element in many models is forecasting demand, which requires some historical demand figures, which are
unavailable or invalid for new and/or less consumption parts. Moreover, the shorter life cycles of products and
better product quality further reduce the possibility of collecting historical data. Unfortunately, the pragmatic
approaches of spare parts inventory management and control are not validated in any way, and then
controllability and objectivity are hard to guarantee (Fortuin & Martin (1999)). The product reliability
characteristics and operating environment based spare parts forecasting method (Ghodrati & Kumar (2005a,b),
as a systematic approach, may improve this undesirable situation. The key questions in any logistic management
are the following:
• Which items are needed as spare parts?
• Which items do we put in stock?
• When do we (re)order?, and
• How many items do we (re)order?
Therefore, the main objective of this paper is to estimate the required number of spare parts and evaluate the
associated risks (risk of shortage of spare parts due to not considering the operating environment leading to
financial losses).
4. The context of spare parts logistics
As mentioned earlier, spare parts are required in the maintenance process of systems. Regarded from the point
of view of a spare parts supplier or systems manufacturer, we can make a distinction between two types of
industrial products that require spare parts:
1. Conventional products: These are the products and systems sold to customers and installed at the
customer's site for the purpose of providing products or services. These systems are under the users’ control, and
are exemplified by machines in production departments, transport vehicles, TVs, computers and private cars.
Mostly there is a Technical Service Department within the client location/organization performing maintenance
and controlling an inventory of spare parts.
In some cases a Technical Service Department of the firm that sold the system, i.e. the original equipment
manufacturer (OEM), carries out the maintenance of these systems under separate contracts and conditions.
2. Functional products: In the functional products category, the user does not buy a machine/system but the
function that it delivers (Markeset & Kumar (2003a)). To avoid the complexities of maintenance management,
many customers/users prefer to purchase only the required function and not the machines or systems providing
it. In this case the responsibility for the maintenance and product support lies with the organization delivering
the required function.
The diversity of the characteristics of spare parts management situations that have to be taken into account
is usually quite large. Therefore, we need to categorize the entire assortment of spare parts. This can be
conveniently accomplished according to the characteristics of an individual aspect, after which specialized
control methods can be developed for each category (e.g. defining the criticality and the risk of shortage).
The following are examples of criteria that can be used for categorization:
• Demand intensity
• Purchasing lead-time
• Delivery time
• Planning horizon
• Essentiality, vitality, criticality of a part
• Price of a spare part
• Costs of stock keeping
• (Re-)ordering costs
In fact, in any given situation, not all the criteria are necessarily relevant and usable.
81
Therefore, as mentioned previously, it is wise to classify the spare parts into groups, to establish
appropriate levels of control over each category. Based on the source of supply and cost, the spare parts can be
classified (in our case as well) into three groups of items, A, B and C, as follows:
A: Parts which can be procured overseas only and whose unit cost is very high (such as hydraulic pumps).
B: Parts which can be procured overseas only and whose unit cost is not high (e.g. seals).
C: Parts that are available locally (e.g. brake pads).
4.1. Estimation of the required number of spare parts
The environmental conditions in which the equipment is to be operated (e.g. temperature, humidity, dust, etc.)
often have a considerable influence on the product reliability characteristics (Kumar & Kumar (1992), Blischke
& Murthy (2000) and Kumar et al. (1992)). In fact, the reliability characteristics (e.g. the failure (hazard) rate)
of a system are a function of the time of operation and the environment in which the system is operating.
The failure (hazard) rate of a system/component is the product of the baseline hazard rate λ0(t), dependent
on time only, and one other positive functional term (that is basically independent of time) which incorporates
the effects of operating environment factors (covariates), e.g. the temperature, pressure and operator’s skill. The
baseline hazard rate is assumed to be identical and equal to the hazard rate when the covariates have no
influence on the failure pattern. Therefore, the actual hazard rate (failure rate) in the Proportional Hazard Model
(PHM) (Cox (1972)) with respect to the exponential form of the time-independent function, which incorporates
the effects of covariates, can be defined as follows:
n
λ (t , z ) = λ0 (t ) exp( zα ) = λ0 (t ) exp(∑ α j z j )
(1)
j =1
where, zj, j = 1, 2,…, n are the covariates associated with the system and αj, j = 1, 2, …, n are the unknown
parameters of the model, defining the effects of each one of the n covariates.
It has been found (Ghodrati et al. (2005b)) that in the case of Weibull distributed life time of
components/system the influencing covariates change the scale parameter, η, only and the shape parameter, β,
remains almost unchanged. Therefore, the scale and shape parameters after the influence of covariates are
⎧β = β 0
⎪⎪
−1
n
⎤ β
⎡
(2)
⎨
⎪η = η 0 ⎢exp(∑ α j z j )⎥
⎥⎦
⎢⎣
j =1
⎩⎪
The required number of spare parts, however, can be calculated for exponentially distributed time to failure
(constant failure rate) as (Ghodrati (2005));
N
(λ t ) x
1 − P (t ) = exp(−λt ) × ∑
(3)
x =0 x!
where, P(t) is the probability of a shortage of spare parts (1- P(t) is the confidence level of spare part availability
or service level) and N is the total number of spare parts available in period t.
And while the failure time follows a Weibull distribution, the number of required spare parts can be
estimated as (Ghodrati (2005));
t
ξ 2 −1
t
(4)
Nt =
+
+ξ
Φ −1 ( p )
MTTF
2
MTTF
where, ζ is the coefficient of variation of the time to failures ( ζ = σ(T) / T ) and Φ-1(p) represents the inverse
normal distribution function.
4.2. Spare parts inventory management
The logistics of spare parts is very important and difficult, since the demand is hard to predict, the consequences
of a stock-out may be disastrous, and the prices of parts are high. If the parts are under stocked, then the
defective systems/machines cannot be serviced, resulting in lost production and consequently customer
dissatisfaction. On the other hand, if the parts are overstocked, the holding costs are high. Situations where some
parts have a very high inventory level and some are in shortage could be quite common. In such a service
system, an efficient inventory management system is essential.
The requirements for planning the logistics of spare parts differ from those of other materials in several
ways: the service requirements are higher as the effects of stock-outs may be financially remarkable, the
demand for parts may be extremely sporadic and difficult to forecast, and the prices of individual parts may be
very high. These characteristics set pressures for streamlining the logistic system of spare parts, and with high
82
requirements for material flow, it is natural that spare parts management should be an important area of
inventory research in the phases of design of technological systems and product support systems. The principle
objective of any inventory management system is to achieve an adequate service level with minimum inventory
investment and administrative costs. The optimum spare parts management strategy must describe what level of
service is to be offered and whether the customers are segmented and prioritized in terms of service, and it must
ensure the availability of parts and the quality of service at reasonable costs as a main concern in maintenance.
In general terms, when designing a spare parts logistics system, the following factors at least usually have
to be considered: the product-specific characteristics (e.g. the reliability characteristics), the location of
customers and their special requirements, and the system/machine operating environment. There are some
operational characteristics of maintenance spare parts that can be used for estimation of the spare parts need and
control of the inventory. The most relevant control characteristics are: criticality, demand and value (Huiskonen
(2001)). The criticality of an item is probably the first aspect that is defined by the spare part logistics
practitioners. The criticality of a part is related to the consequences of the failure of a part for the process in
question in the event of a replacement not being readily available. The impact of a shortage of a critical part
may be a multiple of its commercial value. One practical approach is to relate the criticality to the time in which
the failure has to be corrected. With respect to criticality, parts are either highly critical, medium-critical or noncritical (Huiskonen (2001)). High criticality means operationally that the need for the parts in the event of
failure is immediate, and parts of medium criticality allow some leadtime to correct the failure.
From the logistics control point of view, it is most essential to know how much time there is to react to the
demand need, i.e. whether the need is immediate or whether there is some time to operate. The predictability of
demand is related to the failure mode and process of a part, the intensity of operation and the possibilities of
estimating the failure pattern and rates by statistical means. From a control point of view, it is useful to divide
the parts in terms of predictability into at least two categories: parts with random failures (e.g. electronic parts)
and parts with a predictable wearing pattern (e.g. mechanical parts), and the present research deals with the
second category. The value of a part is a common control characteristic.
5. Risk analysis
5.1. Performance measurement
Since investments in spare parts can be substantial, management is interested in decreasing stock levels whilst
maximizing the service performance of a spare part management system. To assess the result of improvement
actions, performance indicators (such as the fill rate and service rate) are needed. For example, sometimes the
duration of the unavailability of parts is a major factor of concern, and then the waiting time for parts is a more
relevant performance indicator.
Performance measurement for risk represents a problem in its own right. Usually risk items are not issued,
but their presence in stock is justified. In this control category, the most important factor in performance
measurement is the risk of unavailability. In general, this risk can be expressed as (Fortuin & Martin (1999));
RISKi = Probability (Di > Si) × Ci
(5)
where, RISKi = expected financial loss due to risk item i being out of stock
Di = demand for item i during its entire (or remaining) life cycle
Si = initial number of items of type i in stock
Ci = financial consequences if an out-of-stock situation occurs for item i
In the following we will discuss in greater detail the concept of risk analysis and the risk of unavailability of
spare parts when required.
5.2 Risk definition
Kaplan & Garrick (1981) have discussed a number of alternative definitions of risk, including the following:
• Risk is a combination of uncertainty and damage.
• Risk is the ratio of hazards to safeguards.
• Risk is a triplet combination of an event, its probability and its consequences.
The term quantitative risk analysis refers to the process of estimating the risk of an activity based on the
probability of events whose occurrence can lead to undesired consequences. The term hazard expresses the
potential for producing an undesired consequence without regard to how likely such a consequence is. Therefore,
one of the hazards of the spare parts inventory is the shortage of a spare part when it is required, which could
produce a number of different undesired consequences. The term risk usually expresses not only the potential
for an undesired consequence, but also how probable it is that such a consequence will occur. Quantitative risk
analysis attempts to estimate the frequency of accidents and the magnitude of their consequences by different
methods, such as the fault tree and the event tree methods.
83
In fact, maintenance plays a pivotal role in managing risks at an industrial site, and it is important that the
right risk assessment tools should be applied to capture and evaluate the hazards at hand to allow a functional
risk-based approach (Rasche et al. (2000)). Unplanned stoppages or unnecessary downtime will always result in
a temporary upset to the operations flow and output. The cumulative unavailability of the machine (in the case
of a spare parts shortage) and the beneficiation process and the added cost incurred can quickly affect the
financial performance of a system.
6. Risk management
Risk management is an iterative process, as shown in Figure 1 in the appendix. The successful risk management
depends on a clearly defined scope for the risk assessment, comprehensive and detailed hazard mapping and a
thorough understanding of the possible consequences. There are several tools and techniques available to the
managers and engineers that can help to estimate the level of risk better. These may be either ‘subjective –
qualitative’ or ‘objective – quantitative’, as shown in Figure 2 in the appendix. Both categories of techniques
have been used effectively in establishing risk-based safety and maintenance strategies in many industries
(Rasche et al. (2000)).
Quantitative methods are probably ideal for maintenance applications where some data is available and
decisions on system safety and criticality are to be made. Even very basic reliability analysis of maintenance
data can be used effectively in determining the optimum maintenance intervention, replacement intervals or
monitoring strategy. Fault Tree Analysis and Event Tree Analysis (FTA/ ETA), which are considered as semiquantitative methods, are tried and tested system safety tools originating from the defense, nuclear and aviation
industries.
While ETA draws the growth of an event and yields quantified risk estimates of all event paths, FTA is
concerned with the identification and analysis of conditions and factors which cause or contribute to the
occurrence of a defined undesirable event, usually one which significantly affects system performance,
economy, safety or other required characteristics. FTA is often applied to the safety analysis of systems (IEC
1025, 1990). In the following these methods will be presented in greater detail.
6.1 Risk analysis process:
As mentioned earlier briefly, the risk analysis can be accomplished through different steps as follows:
1. Define the potential event sequences and potential incidents.
2. Evaluate the incident outcomes (consequences).
3. Estimate the potential incident frequencies. Fault tree or generic databases may be used for initial event
sequences. Event trees may be used to account for mitigation and post-release events.
4. Estimate the incident impacts on the health and safety, the environment and property (e.g. economy).
5. Estimate the risk. This is achieved by combining the potential consequence for each event with the
event frequency, and determining the total risk by summing over all consequences.
7.1. Fault tree analysis
Fault Tree Analysis (FTA) is classified as a deductive method which determines how a given system state can
occur. FTA is a technique that can be used to predict the expected probability of the failure/hazardous outcome
of a system in the absence of actual experience of failure (Rasmussen (1981)). This lack of experience may be
due to the fact that there is very little operating experience, or the fact that the system failure/hazard rate is so
low that no failures have been observed. The technique is applicable when the system is made up of many parts
and the failure/hazard rate of the parts is known.
The fault tree analysis always starts with the definition of the undesired event whose possible causes,
probability and conditions of occurrence are to be determined. The probability of failure can be a probability of
failure on demand (such as the probability that a car will fail to start when the starter switch is turned). In our
case the event will be “system downtime” and is shown in the top box as a top event. The fault tree technique
has been evolving for the past four decades and is probably the most widely used method for the quantitative
prediction of system failure. However, it is becoming exceedingly difficult to apply in very complicated
problems.
7.2 Event tree analysis
An event tree is a graphical logic model that identifies and quantifies possible outcomes following an initiating
event. The event tree provides systematic coverage of the time sequence of the event propagation.
The event tree structure is the same as that used in decision tree analysis (Brown et al. (1974)). Each event
following the initiating event is conditional on the occurrence of its precursor event. The outcomes of each
84
precursor event are most often binary (success or failure, yes or no), but can also include multiple outcomes (e.g.
100%, 40% or 0%).
Event trees have found widespread applications in risk analysis. Two distinct applications can be identified.
The pre-incident application examines the systems in place that would prevent incident that can develop into
accidents. The event tree analysis of such a system is often sufficient for the purposes of estimating the safety of
the system. The post-incident application is used to identify incident outcomes. Event tree analysis can be
adequate for this application.
Pre-incident event trees can be used to evaluate the effectiveness of a multi-element proactive system. A
post-incident event tree can be used to identify and evaluate quantitatively the various incident outcomes that
might arise from a single initiating (hazardous) event.
Fault trees are often used to describe causes of an event in an event tree. Moreover, the top event of a fault
tree may be the initiating event of an event tree. Note the difference in the meaning of the term initiating event
between the applications of fault tree and event tree analysis. A fault tree may have many basic events that lead
to the single top event, but an event tree will have only one initiating event that leads to many possible outcomes.
The sequence is shown in the logic diagram in the appendix (Figure 3).
8. Case study
As mentioned earlier, operation stoppages in the case of system/machine downtime are mostly due to the
lack/unavailability of required spare parts. Wrong estimation of the required number of spares in the specific
time horizon is one of the reasons for these events. The system/machine operating environment is an important
factor which affects the function of machines. This factor also influences the maintenance and support plan of a
system, and ignoring this factor is one of the most significant reasons for inaccurate forecasting of the required
number of spare parts.
In this paper we have attempted to analyze the risk of ignoring the effects of operating environment factors
on the output of a process in the form of the system/machine downtime and loss of production. For this risk
analysis we carried out mainly event tree analysis, but also applied fault tree analysis as a complementary
method. Both event tree and fault tree analysis have been used in an especial and non-standard way which the
organizational states and decisions as well as events and consequents changes are introduced and taken into
account in the analysis. The studied cases concern the hydraulic pump of brake system of the fleet of loaders in
the Choghart Iron Ore Mine in Iran.
8.1 Construction of event tree
The construction of an event tree is sequential, and like fault tree analysis, is performed from the left to the right
(in the usual event tree convention). The construction begins with the initiating event, and the temporal
sequences of occurrence of all the relevant safety functions or events are entered. Each branch of the event tree
represents a separate outcome (event sequence – as shown in Figure 4 in the appendix).
The initiating event (Step 1) is usually a failure/undesired event corresponding to a release of failure/hazard.
The initiating event in our case is “ignoring the product operating environment”, and the frequency of this
incident was estimated from the historical records.
The safety function and organizational states (Step 2) are actions or barriers that can interrupt the sequence
from an initiating event to a failure/hazardous outcome (in other words, safety functions/organizational states
and decisions are different state descriptions and are components of a chain of explanations). Safety functions
can be of different types, most of which can be characterized as having outcomes of either success or failure
with regard to demand. In our case this step comprises:
a) Inadequate product support planning (organizational state)
b) Inadequate/poor spare parts estimation (organizational decision)
c) Shortage of spare parts when required (event)
d) Excessive system/machine downtime (consequent event)
e) Loss of production in the case of system downtime (consequent event)
f) Economic loss in the case of loss of production (consequent event)
As it is observed and also mentioned earlier, this is not a standard form of event tree analysis. This is a special
form that safety function is defined as an undesired situation (state) as well, instead of state similar to barrier in
the standard form of event tree analysis. Each heading in the event tree corresponds to state/event/condition
(Step 5) of some outcomes taking place if the preceding event has occurred. Therefore, the probability
associated with each branch is conditional and defers from one state to other (e.g. based on long/short term
decision and criticality of spare parts). The source of conditional probability data in our case is historical records
(e.g. daily reports from the operators, the maintenance crew at the workshop and the inventory system),
85
interview with managements of maintenance and spare parts inventory departments and experiences, which is
shown upon the branches in figure 4 in the appendix.
The frequency of each outcome is determined by multiplying the initiating event frequency with the
conditional probabilities along each path leading to that outcome. The qualitative output shows the number of
outcomes that result in the success versus the number of outcomes resulting in the failure of the protective
system in a pre-incident application. The qualitative outcome from a post-incident analysis is the number of
more hazardous outcomes versus the number of less hazardous ones. The quantitative output is the frequency of
each event outcome.
The event tree shown in the appendix (Figure 4) was developed on the bases of the existing situations and
experiences of the involved people (e.g. maintenance and inventory managements) who were aware of the
related consequences of events. There are 15 output branches that cover the most possible combinations of
branches and cases. The upper branches represent the success (yes) connected to poor situations such as the
existence of poor product support planning and/or loss of production (output), and the lower branches represent
the corresponding failure (no), indicating a strong need for accurate spare parts estimation, for instance. The
complete event tree analysis (Figure 4) includes the estimation of all the output branches’ frequencies. As is
seen from the estimated frequencies, the sequences listed below serially have a high probability of loss
(classified into two consequences groups: CRASH and HARD) related to ignoring the product operating
environment factors in the dimensioning of product support and system function.
CRASH :
HARD :
ABCDEF = 71.9712
ABCDE F = 30.8448
ABC DEF = 47.9808
ABCD EF = 17.9928
ABCDEF = 55.9776
ABC D EF = 7.7112
ABCDEF = 55.9776
ABCDEF = 5.1408
These high probability outputs mostly belong to the situation in which the operating environment has been
ignored. Therefore, it is important and recommended to take this factor into consideration when estimating and
managing the spare parts inventory.
8.2 Fault tree analysis
A simple fault tree analysis was also carried out as a complementary method to event tree analysis in this
research, to ascertain the influence of not considering the system’s working environment on the system
downtime, which causes a need for repair, maintenance and consequently spare parts. As can be observed in the
fault tree chart (Figure 5 in the appendix), the probability of system stoppage is influenced and controlled by the
operating environmental factors. This fault tree (Figure 5) no exact probability corresponding to gates are
calculated). In addition, this figure because of using only OR gates, then is not considered system level.
Conclusions
It has been clearly shown that the operating environment has a significant influence on the planning of product
support and spare parts requirements, through product reliability characteristics (Ghodrati & Kumar (2005a,b)).
Therefore, product support specifications should be based on the design specifications and the conditions faced
by the customer. The remarkable influence of considering and/or ignoring the operating environment factors on
the forecasting and estimation of the required spare parts is validated by the result of risk analysis.
In the present paper we have performed a risk analysis of not considering the system working conditions in
spare parts planning through a new and non-standard event tree and fault tree analysis. We introduced and
implemented an event tree analysis in which the states of organization and managerial decisions took place in
risk analysis. In other word, we used the undesired states instead of barriers in combination with events and
consequents changes as a safety function in event tree analysis. Based on the results from the event tree analysis,
there is a considerable risk associated with ignoring these working environment factors, which might causes
irretrievable losses.
References
Billinton, R. and Allan, R. N. (1983) Reliability Evaluation of Engineering Systems: Concepts and Techniques, Boston, Pitman Books
Limited
Blanchard, B. S. (2001) Maintenance and support: a critical element in the system life cycle, Proceedings of the International
Conference of Maintenance Societies, May, Melbourne, Paper 003
Blanchard, B. S. and Fabrycky, W. J. (1998) Systems Engineering and Analysis, 3rd ed., Upper Saddle River, NJ, Prentice-Hall
Brwon, R. V., Kahr, A. S. and Peterson, C. (1974) Decision Analysis for the Manager, New York, Holt, Reinhardt & Winston
Cox, D. R. (1972) Regression models and life-tables, J. of the Royal Statistical Society, B34, 187-220
86
Fortuin, L. and Martin, H. (1999) Control of service parts, International J. of Operations & Production Management, 19 (9), 950-971
Ghodrati, B. (2005) Reliability and operating environment based spare parts planning, PhD thesis, Luleå University of Technology,
Sweden, ISSN: 1402-1544
Ghodrati, B. and Kumar, U. (2005a) Operating environment based spare parts forecasting and logistics: a case study, International J. of
Logistics: Research and Applications, 8 (2), 95-105
Ghodrati, B. and Kumar, U. (2005b) Reliability and operating environment based spare parts estimation approach: a case study in Kiruna
Mine, Sweden, J. of Quality in Maintenance Engineering, 11 (2), 169-184
Gnedenko, B. V., Belyayev, Y. K. and Solovyev, A. D. (1969) Mathematical Methods of Reliability, New York, Academic Press
Goffin, K. (2000) Design for supportability: essential component of new product development, Research Technology Management, 43
(2), March/April, 40-47
Huiskonen, J. (2001) Maintenance spare parts logistics: special characteristics and strategic choices, International J. of Production
Economics, 71 (1-3), 125-133
Kaplan, S. and Garrick, B. J. (1981) On the quantitative definition of risk, Risk Analysis, 1, 11–27
Kumar, D. and Kumar, U. (1992) Proportional hazard model: a useful tool for the analysis of a mining system, Proceedings of the 2nd
APCOM Symposium, Tucson, Arizona, 6-9 April, pp. 717-24
Kumar, D., Klefsjö, B. and Kumar, U. (1992) Reliability analysis of power cables of electric loader using proportional hazard model,
Reliability Engineering and System Safety, 37, 217-22
Kumar, U. D., Crocker, J., Knezevic, J., El-Haram, M. (2000) Reliability, Maintenance and Logistic Support: a Life Cycle Approach,
Boston, Mass., Kluwer Academic Publishers
Markeset, T. and Kumar, U. (2001) R&M and risk analysis tools in product design to reduce life-cycle cost and improve product
attractiveness”, Proceedings of The Annual Reliability and Maintainability Symposium, 22-25 January, Philadelphia, 116-122
Markeset, T. and Kumar, U. (2003) Integration of RAMS information in design processes: a case study, Proceedings of the 2003 Annual
Reliability and Maintainability Symposium, Tampa, FL, 20-24 January
Markeset, T. and Kumar, U. (2003a) Design and development of product support and maintenance concepts for industrial systems, J. of
Quality in Maintenance Engineering, 9 (4), 2003, 376-392
Rasche, T. F. and Wooley, K. (2000) Importance of risk based integrity management in your safety management system: advanced
methodologies and practical examples, In Queensland Mining Industry Health & Safety Conference 2000, Townsville: Queensland
Mining Council
Sheikh, A. K., Younas, M. and Raouf, A. (2000) Reliability based spare parts forecasting and procurement strategies, Maintenance,
Modeling and Optimization, 81-108, Boston, Mass., Kluwer Academic Publishers
Smith, C. and Knezevic, J. (1996) Achieving quality through supportability: Part 1: Concepts and principles, J. of Quality in
Maintenance Engineering, 2 (2), 21-29
87
Appendix
Figure 1. Risk management process
Figure 2. Risk Analysis Options [Source: Rasche & Wooley (2000)]
Figure 3. Logic diagram for event tree analysis
88
Event Tree
Ignoring Operating Environment
(in spare parts estimation)
Figure 4. Event tree analysis for the risk of ignoring the product operating-environment factor in spare parts planning
89
Figure 5. Partial fault tree analysis
90
Modern maintenance system based on web and mobile technologies
Jaime Campos1, Erkki Jantunen2, Om Prakash3
1. School of Technology and Design, Växjö University, SE-351 95 Växjö, Sweden.
2. Senior Research Scientist, D.Sc. (Tech.), VTT Technical Research Centre of Finland, P.O.Box 1000, FI02044 VTT, Finland.
3. Associate Professor, School of Technology and Design, Växjö University, SE-351 95 Växjö, Sweden.
Abstract: The paper illustrates the development of an e-monitoring and e-maintenance architecture and system
based on web and mobile device, i.e. PDA, technologies to access and report maintenance tasks. Rarity of
experts led to the application of artificial intelligence and later, distributed artificial intelligence for condition
monitoring and diagnosis of machine condition. Recently, web technology and wireless communication
emerged as an alternative to provide maintenance with a powerful decision support tool which makes it possible
to have all the necessary information wherever it is needed for maintenance analysis and its various tasks. The
paper goes through the characteristics of using web and mobile devices for condition monitoring and
maintenance. It illustrates the ICT used to communicate among the different layers in the architecture/system
and its various client machines. The practical examples are related to the maintenance of rotating machinery,
more specifically, diagnosing rolling element bearing faults.
Keywords: Condition monitoring, condition based maintenance, web application, mobile application, mobile
device, PDA, database architecture
1. Introduction
Condition based maintenance is based on condition monitoring which involves the acquisition of data,
processing, analysis, interpretation and extracting useful information from it. It provides the maintenance
personnel with the needed resources to identify a deviation from predetermined values. In the case of a
deviation normally, diagnosis is done to determine the cause of it. Finally, a decision, regarding when and what
maintenance tasks are to be performed, is taken. The prognosis is done to foresee a failure as early as possible
and be able to plan the maintenance task in advance, (Jantunen (2003)). The decision support systems that have
been used to help maintenance department to address this matter have changed and developed over time. In the
1980s, expert system was used and in the 1990s various techniques like the Neural Network and Fuzzy Logic
were used in condition monitoring (Wang (2003) and Warwick et al. (1997)). Distributed artificial intelligence
has also been used in condition monitoring after the advent of Internet during the late 1990s (Rao et al. (1996),
Rao et al. (1998a), Rao et al. (1998b) and Reichard et al. (2000)). In this process recently, web technology and
agent technology have started to appear in maintenance and condition monitoring. First review on the subject
appeared in 2006 (Campos & Prakash (2006)). These technologies got wider acceptance because of the agents'
capability to operate on distributed open environment like the Internet or corporate Intranet and access
heterogeneous and geographically distributed data bases and information sources (Feng et al. (2004), Sycara,
(1998)). Recently, the combination of web technology and wireless communication coming up as an alternative,
to provide maintenance personnel with the right information on time, wherever it is needed for maintenance
analysis and its various tasks. This paper proposes an e–maintenance, i.e., web and mobile device architecture
for maintenance and condition monitoring purposes.
2. The Web and Mobile architecture
The web technology, i.e. Internet and Intranet, is continuously evolving and offering various techniques to
utilise the application software's that run on the net. Intranet uses Web technology to create and share
knowledge within an enterprise only. The Web consists of applications that are developed in different
programming languages such as Hyper Text Markup Language (HTML), Dynamic Hyper Text Markup
Language (DHTML), Extensible Markup Language (XML), Active Server Pages (ASP), Java Server Pages (JSP)
and Java Database Connectivity (JDBC) etc. The protocol that normally dominates the communication between
the Web and its various actors is the Hypertext Transfer Protocol (HTTP) and Transmission Control
Protocol/Internet Protocol (TCP/IP). Recently, Web services (WS) started to appear in Web applications. They
also use HTTP to send and receive content messages.
Figure 1, illustrates the proposed Web and Mobile architecture. In the left there are rotating machines. Next is
the proposed three tier web and mobile architecture system. Each tier has its own specific task. The database
servers store the data entering into the system. They provide data and information to the middleware and to
91
various client machines. The user interacts with the system through the client machines, i.e. computers and
mobile devices.
Figure 1. The Web and Mobile Architecture
The middleware consist of the application/Web services and Web server. The Web servers are the computers
connected to the Internet or Intranet and acting as the server machine. WS are the application softwares that are
designed to support interoperability among the distributed applications over a network (World Wide Web
Consortium (W3C), www.w3.org). WS facilitates conveying of the messages from and to the client machines.
The potential of WS is that they can be consumed through the Web to any application program independent of
the language used. They consist of three basic components (Newcomer (2002) and Meyne & Davis (2002)).
First is XML. It is a language that is used across the various layers in the web services. The second is the soap
listener. It works with packaging, sending and receiving messages over the HTTP. The third component is the
Web Services Description Language (WSDL) the code that the client machine uses to read the messages it
receives. The WS development can be done with many programming languages like from Java Sun or Microsoft.
Other important component in the WS is the Repository for Universal Description, Discovery and Integration
(UDDI) protocol. The UDDI produces a standard platform that the WS can use and provide various applications
to find access and consume the WS over the internet (www.uddi.org).
3. The data and the system architecture
Databases are characterised of various factors such as their ability to provide long-term reliable data storage,
multi user access, concurrency control, query, recovery, and security capabilities. In maintenance are these
important factors because of the need of for example gathering and storing data for the purpose of monitoring
the machines' health. The database technologies have been changing over time and a review is available, Du &
Wolfe (1997). The review goes through the database architectures such as relational database, semantic data
modelling, distributed database systems, object oriented database and active databases. They mention that the
most used is the relational database architecture. It has high performance when simple data requirements are
involved and it has been widely accepted. However, other database architectures may be needed when complex
data is used. The OSA-CBM (Open System Architecture for Condition Based Maintenance) and MIMOSA
(Machinery Information Management Open Systems Alliance) are two organisations, which have been active in
developing standards for information exchange and communication among different modules for CBM,
[Thurston (2001), www.mimosa.org, ]. The OSA-CBM has been partly funded by the navy through a Dust
(Dual Use Science and Technology) program [Thurston (2001), www.osacbm.org]. There were various
participants from industrial, commercial and military applications of CBM technology such as Boeing,
Caterpillar, Rockwell Automation, Rockwell Science Center, Newport News, and Oceana Sensor Technologies.
MIMOSA developed a Common Relational Information Schema (CRIS). It is a relational database model for
different data types that need to be processed in CBM application. The system interfaces have been defined
according to the database schema based on CRIS. The interfaces’ definitions developed by MIMOSA are an
open data exchange convention to use for data sharing in today's CBM systems. Other important contribution in
this area is the ISO 17359 standards, which specifies the reference values to consider when a condition
monitoring programme is implemented like for example standards for vibration monitoring and analysis. These
were taken into consideration while developing the system.
92
4. Development of the System
The system used the three Information and communication technologies (ICT); the web services, the web server
and the remote access for the communication between the database servers and client machines (Fig 2).
Database server in the system can also be directly and remotely accessed by mobile devices. This is done via a
wireless communication. There are various communication protocols having different characteristics for the
wireless communication between the client machines and the objects in the system. The Mobile devices, in
normal cases, have narrow bandwidth. If interaction with the servers is too frequent, the network gets heavily
loaded and slows down. This problem was partially overcome in the development process through the use of
multiple forms on a single mobile page.
Figure 2. ICT in the architecture
The system was then tested with the simulated signal from a rolling element bearing. The data flow and the
various processes involved are illustrated in Fig. 3. While doing so OSA-CBM, Mimosa Cris data structure and
ISO 17359 standards were taken onto consideration.
Figure. 3. The data flow and its various processes.
In Fig. 3, the sensor data is gathered from the various sensors in the machine. They are next stored in the
database, more specific in the data acquisition layer. From the data acquisition layer are relevant time data sent
to next layer where the signal analysis is taken place. The results of the signal analysis, illustrated in some
parameters are compared with condition monitoring standards in the Condition Monitoring layer. Finally a
diagnosis done and a decision is taken. The results of the diagnosis are displayed next. The Fig. 4 to 6 below
shows various outputs from the mobile device emulator’s windows from the Web and Mobile architecture. The
first mobile window, Fig 4, illustrates the vibration velocity, RMS values, in mm/s vs. date.
Figure 4. RMS chart.
Figure 5 shows vibration velocity, RMS values in mm/s in time domain and in Fig.6 in frequency domain.
93
Figure 5. Time data.
Figure 6. Spectrum.
Security factors should be considered when developing applications with ICT in this case web and mobile
device applications. The factors that make web and mobile devices more vulnerable are the cases of lack in an
authentication process and lack of secure communication (Meyne & Davis (2002)). There are ways to decrease
these factors with for example security policies and encryption. The security aspects, however were not
considered in the development of the system, nevertheless, they are important.
The mobile device provides the maintenance personnel with a mobile user interface of the whole e-maintenance
system. The device is a relatively lightweight monitoring system with long battery life capacity and memory
that can be used for offline work. The maintenance engineers can also, through the device, get information from
other sources such as the Computerized Maintenance Management System (CMMS) to be able to make a work
order or see the availability of spare parts. It provides also possibilities to access, if needed, the history of the
machine stored in the CMMS through the Wireless Local Area Network (WLAN).
Maintenance engineer while working offline but using his mobile device can still have access to the relevant
data available on the servers and services off the architecture. This is useful since the mobile device has a small
memory to store data and for further analysis. In certain cases it is needed to pinpoint the right condition of the
equipment. However, the data is normally located and processed on the servers and services off the architecture.
The mobile device provides also the personnel with abilities to communicate to the local intelligent sensors or
other kind of sensors. It is possible when sensors are equipped with an AD-card located on the Universal Serial
Bus (USB). In any case, the normal way in which the mobile device communicates with the architecture is
through the WLAN and the web technology such as the Web Services. Other features that the personnel can use
are, for example, the calendar and the word processing, which facilitates the maintenance personnel daily work.
Conclusions
The wireless technology seems to be an important factor in future maintenance. This is due to the elimination of
connecting cables between the monitored machine/equipment and monitoring systems. The experience shows
that it is normal that the mobile device requires frequent interaction with the server and this can cause the
performance to decrease. For this reason, it is important that mobile internet performance is high since the user
satisfaction is crucial. In the present work the performance was improved with the use of multiple forms on a
single mobile page. The mobile device could also access the data using Web services. It is a useful development
94
as the data needed for diagnosis and prognosis are normally huge in amount and the storage capacity of a mobile
device is small. For this reason, the use of Web services for this part of the system was a good approach to take.
In this way the load on the server also decreases and it helps to improve the performance of the Web and
wireless communication. Finally, maintenance personnel can remotely monitor the health of equipment that may
be located geographically any where. The capacity of the wireless network used is the only limiting factor.
Acknowledgements
This work presented is based on results from the project Dynamite. The Dynamite Project is a European
Community funded research project. The project is an Integrated Project instrument funded under the Sixth
Framework Programme.
References
Campos, J. and Prakash, O. (2006) Information and Communication Technologies in Condition Monitoring and Maintenance, in Dolgui,
A., Morel, G and Pereira, C.E. (Eds.) Information Control Problems In Manufacturing. Post conference proceedings, 12th IFAC
International Symposium, StEtienne, France. Elsevier.Vol. II
Du, T. C-T, and Wolfe, P. M. (1997) Overview of emerging database architectures, Computers & Industrial Engineering, 4 (32), 811821
Feng, J. Q., Buse, D. P., Wu, Q. H. and Fitch, J. (2002) A multi-agent based intelligent monitoring system for power transformers in
distributed substations, International Conference on Power System TechnologyProceedings (Cat. No.02EX572), 3, 1962- 1965
Jantunen, E. (2003) Prognosis of wear progress based on regression analysis of condition monitoring parameters, Tribologia.Finish
Journal of Tribology, 22
Meyne, H. and Davis, S. (2002) Developing web applications with ASP.NET and C#, Wiley Computer Publishing, John Wiley & Sons,
Inc, ISBN 0-471-12090-1
Newcomer, E. (2002) Understanding Web Services: XML, WSDL, SOAP, and UDDI, Addison Wesley Professional ISBN: 0-201-750813
Rao, M., Yang, H. and Yang, H. (1996) Integrated distributed intelligent system for incident reporting in DMI pulp mill. Success and
Failures of Knowledge-Based Systems in Real-World Applications, Proceedings of the First International Conference. BKK'96,
1996, 169- 178
Rao, M., Zhou, J. and Yang, H. (1998a) Architecture of integrated distributed intelligent multimedia system for on-line real-time process
monitoring, SMC'98 Conference Proceedings.1998 IEEE International Conference on Systems,Man, and Cybernetics (Cat.
No.98CH36218), 2, 1411- 1416
Rao, M., Yang, H. and Yang, H. (1998b) Integrated distributed intelligent system architecture for incidents monitoring and diagnosis,
Computers in Industry, 37, 143-145
Reichard, K. M., Van Dyke, M. and Maynard, K. (2000) Application of sensor fusion and signal classification techniques in a distributed
machinery condition monitoring system, Proceedings of SPIE - The International Society for Optical Engineering, 4051, 329-336
Sycara, K. P. (1998) MultiAgent Systems, AI Magazine, 19 (2)
Thurston, M. G. (2001) An open standard for Web-based condition-based maintenance systems, 2001 IEEE Autotestcon Proceedings.
IEEE Systems Readiness Technology Conference, 2001, 401- 415
Wang, K. (2003) Intelligent Condition Monitoring and Diagnosis System A Computational Intelligent Approach, Frontiers in Artificial
Intelligence and Applications, 93, ISBN 1-58603-312-3, pp 132
Warwick, K., Ekwue, A. O. and Aggarwal, R. (Eds.) (1997) Artificial Intelligence Techniques in Power Systems, Power & Energy,
Publishing & Inspec, ISBN: 0 85296 897 3
95
A literature review of computerised maintenance management support
Mirka Kans
School of Technology and Design, Department of Terotechnology, Växjö University, S-351 95 Luckligs plats 1,
Sweden
[email protected]
Abstract: Maintenance management information technology (MMIT) systems have existed some forty years.
This paper investigates the advancement of these systems and compares the development of MMIT with other
corporate information technology (IT) systems by the means of a literature study of 97 scientific papers within
the topic of MMIT and additional readings in books. The study reveals that the focus of MMIT has changed in
several aspects during the forty years that has been investigated; from technology to use; from maintenance
function to maintenance as an integrated part of the business; from supporting reactive maintenance to proactive
maintenance and from operative to strategic maintenance considerations.
1. Introduction
Research shows that information technology (IT) investments have a positive correlation on companies
profitability and competitiveness, thus that IT has strategic importance, see for instance, Johnsson (1999), Kini
(2002) and Dedrick et al. (2003). Information technology systems have been in use in companies some 40 years
and are today a natural tool for many workers. IT systems for maintenance purposes have existed approximately
as long as computers have been available for commercial use. Even though, has the development of
maintenance management information technology (MMIT) been in pace with the general development of
corporate IT? And in what way has MMIT made advances during the forty years of existence? These questions
will be investigated using literature as a basis. After reviewing literature about MMIT several times the author
has not yet found a literature study describing the development of MMIT. To fill this gap, we will in this paper
present a literature review over the topic of MMIT.
To be able to understand the development of maintenance management IT we will first look at the general
computerisation of companies. Next section presents three main phases within corporate information technology
development; the Introduction, the Coordination and the Integration phase. The phases could be compared to the
six stages of IT growth and maturity presented in Nolan (1979), see Figure 1, whereas the first two stages,
Initiation and Contagion, are similar to the Introduction phase. Here technology and functional automation is
stressed. In stages three and four, Control and Integration, the top management gains control over IT resources
and the IT resources are supporting the overall business strategy, i.e. for coordination of business activities as in
the Coordination phase. The last two stages in Nolan’s IT maturity model, Data administration and Maturity,
deals with data sharing and information systems as a strategic matter. These steps are similar to the Integration
phase.
Relative level of
planning and
control in
installations and
of IT
expenditures
Transition point: from
computer management to
data resources management
Nolan’s six stages
of growth and
maturity
Main phases
within corporate
IT development
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Initiation
Contagion
Control
Integration
Data
administration
Maturity
Phase 1
Phase 2
Phase 3
Introduction
Coordination
Integration
Figure 1. Corporate IT development
A literature survey covering maintenance information technology was conducted in autumn 2003 / spring
2004. The databases used for the survey were Elsevier, Emerald, and IEEE. Following combinations of
keywords were used; decision support system, expert system, computerised and information system combined
with maintenance, asset management or maintenance management system. An additional search was made in a
full text database search tool (ELIN) that integrates a vast number of databases, e.g. Elsevier, Emerald, IEEE,
96
Proquest and Springer, using the same keywords as above, i.e. decision support system, expert system,
computerised and information system combined with maintenance, asset management or maintenance
management system. A total of 97 articles within the relevant topic were found in this survey. All articles were
published in the period 1988 to 2003. Additional reading was made in books about maintenance and
computerised maintenance management systems, especially to capture the missing period 1960-1988. The
number of articles per year is presented in Table 1. The historical description is divided into three periods, 19601992, 1993-1998 and 1999-2003. The amount of articles from each period is found in Table 1. The periods are
representing different stages of maintenance information technology maturity and are consistent with the three
phases of corporate IT development; Introduction, Coordination and Integration.
Year
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
Total:
Number
articles
2
5
5
5
4
7
5
7
9
6
6
4
6
9
6
11
97
of
Period
Number of articles
1988-1992
21
1993-1998
40
1999-2003
36
97
Table 1. Number of articles per year
2. Information technology systems within companies: a historical perspective
A historical review of general corporate IT is given in the following based upon the model presented in the
previous section.
¾ Introduction: The emergence of corporate information technology
The use of IT emerged in administration mainly to automate information processing, Persson et al. (1981) and
Dahlbom (1997). The first computers, the mainframes, filled a whole room and special crew fed the mainframe
with input data and programs for e.g. sorting, listing and analysing payrolls, customer registers, supplier
registers and so on. Computing time was expensive and the most crucial calculations and analyses were
prioritised. These computers were for the larger companies to use, while the medium and small sized still had to
rely on manual information handling. When the mini computers emerged in the 70s, computer power was
accessible for “everyone”, Persson et al. (1981).
The IT systems that were developed for mini's contained department specific applications. The financial
department ran spreadsheets and ledger applications, the management word processors and market analysing
applications while the personal administration computed payments and kept a payroll application. Each
application structured the information in its own way and data were stored in file systems. Cross-functional
access of data and information was not supported. Neither was the management able to get reports including
information from more than one area of the business, if not special applications were bought or developed for
this purpose, see for instance Mullin (1989).
¾ Coordination: Connecting the systems together
In the late 70s and 80s the use of middleware, devices that act like a translator between two systems both talking
their own language, enabled the communication without actual changes in the specific systems, see e.g.
Tuunainen (1998). With the ability to interchange information between IT systems, the possibilities to
coordinate the organisation grew, both in the horizontal and hierarchical level.
The 80s is characterised by incorporation efforts. Now all separate IT systems should be incorporated into a
unity and this resulted in compatibility problems for the technical devises. There were vast amounts of hardware
and software standards, sometimes one for each vendor. Even different versions of the same application were
sometimes incompatible. These compatibility problems were slowly overcome by standardising hardware and
software, see for instance Adler (1995) and Hosaka et al. (1981). On the data level problems regarding syntax
heterogeneity (the structure of the information), semantic heterogeneity (the content of the information) and
pragmatic heterogeneity (how the information is handled, e.g. concurrency control), Toussaint et al. (2001). The
data heterogeneity problems were overcome for instance by using relational databases as data repositories,
Hoffer et al. (2005). In the early coordination phase when middleware was used, only predefined data and
97
information could be interchanged. With incorporated IT systems and central databases, data and information
were accessible for everyone using the IT system, Dahlbom (1997) and Kelly et al. (1997).
¾ Integration: One company wide solution
In the last part of the 90s total business solutions emerge. These solutions integrate databases and functionality
as well as providing a common user interface. In industries enterprise resource planning (ERP) systems and
industrial automation such as computer integrated manufacturing (CIM) and supervisory control and data
acquisition (SCADA) systems had and still have a central role of bringing total integration, Gordon (2002) and
Nikolopoulos et al. (2003). CIM and ERP systems integrate “everything”, CIM in a production control level and
ERP on the company administrative level. Until recently, these systems have been separated, but in the field the
trend is now moving towards integration of these technical systems and administrative systems, see for example
Dahlfors & Pilling (1995) and Bratthall et al. (2002).
3. The development of computerised maintenance support
Based upon the literature survey described in the introduction, the author would like to present a historical
review of the development of maintenance information technology from the 60s and until today, as it is
described in scientific literature.
¾ Introduction: The emerge of maintenance information technology
According to Wilder and Cannon (1993) computerised maintenance support was not existent before the year
1960. There were maintenance planning systems available for mainframes in the 70s, where the computation
time was shared with other departments giving high priority for the most important processing, Kelly (1984).
Maintenance was most likely not one of the high priority activities and Kelly concludes that the tasks were
limited to some scheduling of preventive actions. The first maintenance IT automation step was available for the
large companies and supported preventive maintenance, though in a low extent, while other companies had to
rely on manual maintenance management.
In the beginning of the 80s minicomputers with dedicated programs were developed giving higher freedom
for the maintenance department to systematise, plan and check up maintenance activities, Wilder and Cannon
(1993) and Kelly (1984). In 1985, at least 60 CMMS were available, Raouf et al. (1993). At this time, the
backbone of CMMS was established, consisting of functionality for scheduling, plant inventory, stock control,
cost and budgeting and maintenance history, Wilson (1984). Another popular kind of IT support was expert
systems (ES) for reducing downtime when conducting reactive maintenance. About half of the papers written
during the period 1988-1990 are within ES for fault detection and troubleshooting, i.e., see for example Walters
(1990) and Ho et al. (1988).
Technological innovation is also discussed in the late 80s. One project conducted by US Navy aimed at
digitalising and integrating several sources of maintenance information, using ultra modern techniques such at
optic disc storage, Landers et al. (1989). Furthermore, US Air force demonstrated the first hand-held computer
that could integrate on-the-place failure data with historical data and manuals in order to reach failure diagnosis,
Link (1989). The state-of-the-art in computerised maintenance management systems of the late 80s is given by
Mullin (1989), who describes the computer aided maintenance management systems at Ford as developed
independently, poorly integrated and with poor interfaces.
¾ Coordination: Structuring the maintenance IT resources
About a third of the papers studied from the first part of the 90s deal with the concept of CMMS and words like
efficiency and cost reduction occurs. Ben-Bassat et al. (1993) present an expert system for cost-effective
utilisation of maintenance resources. The ability to identify and follow up maintenance costs using CMMS' is
discussed in Gehl (1993). Jones (1994) concludes that if a CMMS will be cost-efficient, its introduction and use
must be connected to organisation culture and the maintenance as well as business strategy. The aspect of easy
used interfaces such as in Hall et al. (1994), where graphics are used to reach user friendliness, is also pointed
out. The word integration shows up for the first time now. Fung (1993) promotes the use of CMMS to integrate
maintenance with e.g. quality assurance and energy management. Nolasco (1994) discusses CMMS and
integration between maintenance, purchase and accounting and Sherwin & Johnsson (1995) promotes the use of
management information systems to integrate maintenance and production. MMIT is thus apprehended as a
useful resource; the work of connecting different maintenance application begins, as well as the connections
between maintenance and other working areas are explored.
The main focus of the 90s lies in how to manage preventive maintenance though, for instance in the shape
of expert systems for policy planning, scheduling and fault diagnosis, see e.g. Batanov et al. (1993), or IT
systems for preventive maintenance management, Fung (1993), Gehl (1993) and, Raouf et al. (1993). At this
time, there are more than 200 commercial CMMS packages available in North America alone, Campbell (1995).
The military is still in the front-end of maintenance, for instance with two projects aimed at computerised life
98
cycle cost (LCC) analysis of weapon systems, including preventive maintenance considerations, Hansen et
al.(1992) and Awtry et al. (1991). LCC simulation is also the topic of Ostebo (1993). Expert systems are still a
common topic in the period, see for instance Batanov et al. (1993) and Mitchell (1991).
Also, papers discussing computer support for predictive maintenance appear. Sato et al. (1992) and Itaka et
al. (1991) describe an advanced system for condition monitoring and maintenance communication for power
transport lines. Wichers (1996) discusses a reliability-centred maintenance based system for maintenance
planning, specially stressing condition monitoring, which is connected to a manual or computerised
maintenance management system. Pearce & Hall (1994) recognise the advantages of vibration monitoring and
the importance of connecting on-line monitoring data to a computerised maintenance management system. We
can see that computerised support for condition monitoring is developed but not widely incorporated into the
administrative IT systems.
¾ Integration: Maintenance and maintenance IT as a part of the whole company
In the end of the 90s, the economic aspect appears even stronger. Maintenance IT is discussed with respect to
cost-effectiveness and cost reduction, see for example Labib (1998) and Weil (1998). Johnsson (2000) connects
IT maturity in maintenance with profitability. The term integration is used to discuss integrated CMMS
solutions during this period, where e.g. integration of CMMS and asset management systems is discussed,
Boyles (1999) and Weil (1998), as well as the benefits of integrated CMMS are addressed, Panucci (2000).
Zhang et al. (1997) discusses the use of artificial intelligence to achieve an integrated maintenance management
system that takes into consideration not only equipment condition, but also production quality, efficiency and
costs. The development of computerised communication methods, such as remote monitoring, telemaintenance
and geographical information systems, also affects the topics of papers; see e.g. Hadzilacos et al. (2000) and
Laugier et al. (1996).
The topic of decision support systems has increased continuously during the studied years. In the period of
2001-2003 decision support systems are discussed in eight of twenty six papers, i.e. about one third of the
papers (to compare with period 1988-1990 when the figure was two out of twelve papers). Yam et al. (2001) for
instance discusses operational and maintenance cost reduction as the result of a more accurate condition-based
fault prediction and diagnosis reached by decision support systems. Other examples of IT support for diagnosis
and prognosis are found in Yagi et al. (2003) and Zhang Wang et al. (2001). Noticeable is also that papers about
expert systems have decreased from about 40% (five of twelve papers) in 1988-1990 to about 20% (five of
twenty six papers) in 2001-2003.
4. Conclusions
The survey of computerised maintenance support reveals that the focus of MMIT has changed in four aspects
during the forty years that has been investigated: 1) From technology to use, 2) From maintenance function to
business integration, 3) From reactive maintenance to proactive maintenance and 4) From operative to strategic
maintenance considerations. These shifts in focus are further discussed below.
¾ Technology Æ use
In the microcomputer era automation of routines was in focus. Main benefits of maintenance IT lied in reducing
manual paper work and getting a grip of maintenance specific resources. IT in enterprises was a new
phenomenon and the technology itself was stressed in the early papers. As the IT maturity of enterprises grew,
the technological construct of maintenance IT was discussed less often. Instead, the focus shifted to the use of
IT. MMIT is in the later papers treated as a tool, which can benefit the user if used properly, and the actual
benefits are stressed.
¾ Maintenance function Æ business integration
While the literature in the early years is considering the maintenance function and its information technology
needs, an increased use of the integration concept is seen in later papers. By the use of, and by integrating,
CMMS advantages in maintenance could be achieved.
¾ Reactive maintenance Æ predictive-proactive maintenance
A trend of increasing IT support for maintenance management activities appears in the description, from mainly
supporting technical reactive and preventive maintenance strategies in the microcomputer era to predictive
condition-based strategies when different corporate IT resources could be integrated. Today, as predictiveproactive maintenance strategies, which help in avoiding damage initiation by detecting the damage causes, are
strongly gaining ground we should be able to see this reflected in contemporary research. The growth in amount
of papers published the last years discussing integration and DSS for maintenance could be a tendency of this.
Furthermore, the discussion about financial benefits of maintenance and the connection between maintenance
and production performance together with IT would imply a more holistic view of the maintenance role in
companies. Having a holistic perspective on maintenance enables predictive-proactive maintenance.
99
¾ Operative maintenance considerations Æ strategic maintenance considerations
A shift in focus from operative maintenance concerns to strategic maintenance concerns could be seen in the
study. Notable is e.g. the increased number of papers in the later years dealing with economic advantages that
could be reached by using CMMS, whereas the focus in the early years were in describing how the operative
maintenance work could be speeded up and automated by using computers.
References
Adler, R. M. (1995) Emerging standards for component software, Computer, 28 (3), 68-77
Al-Najjar, B. (1996) Total quality maintenance An approach for continuous reduction in costs of quality products, J. of Quality in
Maintenance Engineering, 2 (3), 4-20
Alsyouf, I. (2004) Cost effective maintenance for competitive advantages, PhD thesis. Växjö university, School of industrial engineering
Awtry, M. H., Calvo, A. B. and Debeljak, C. J. (1991) Logistics engineering workstations for concurrent engineering applications,
Proceedings of the IEEE 1991 National Aerospace and Electronics Conference, 1991. NAECON 1991., 3, 1253-1259
Batanov, D., Nagarur, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based system for maintenance management,
Artificial Intelligence in Engineering, 8 (4), 283-291
Ben-Bassat, M., Beniaminy, I., Eshel, M., Feldman, B. and Shpiro, A. (1993) Workflow management combined with diagnostic and
repair expert system tools for maintenance operations, AUTOTESTCON '93. IEEE Systems Readiness Technology Conference
Proceedings, 367-375
Boyles, C. (1999) CMMS and return on assets, Chemical Processing, 62 (5), 62-65
Bratthall, L. G., van der Geest, R., Hofmann, H., Jellum, E., Korendo, Z., Martinez, R., Orkisz, M., Zeidler, C. and Andersson, J.S.
(2002) Integrating hundred's of products through one architecture - the industrial IT architecture, Proceedings of the 24th
International Conference on Software Engineering, 2002, 604-614
Campbell, J. D. (1995) Outsourcing in maintenance management: A valid alternative to self-provision, J. of Quality in Maintenance
Dahlbom, Bo. (1997) The New Informatics, Scandinavian J. of Information Systems, 8 (2), 29-48
Dahlfors, F., Pilling, J. (1995) Integrated information systems in a privatized and deregulated market, International Conference on
Energy Management and Power Delivery, 1995. Proceedings of EMPD '95., 1, 249 -254
Dedrick, J., Gurbaxani, V. and Kraemer, K. L. (2003) Information Technology and Economic Performance: A Critical Review of the
Empirical Evidence, ACM Computing Surveys, 3 (1), 1-29
Fung, W. Y. (1993) Computerized maintenance management system in a railway-based building services unit, ASHRAE Transactions, 99
(1), 72-83
Gehl, P. (1993) Management application of CMMS reports, Advances in Instrumentation and Control: International Conference and
Exhibition, 48 (3), 1535-1556
Gordon, L. A. (2000) The e-skip-gen effect. The emergence of a cybercentric management model and the F2B market segment for
industry, International J. of Production Economics, 80 (1), 11-29
Hadzilacos, T., Kalles, D., Preston, N., Melbourne, P., Camarinopoulos, L., Eimermacher, M., Kallidromitis, V., Frondistou-Yannas, S.S.
and Saegrov, S. (2000) UtilNets: a water mains rehabilitation decision-support system, Computers, Environment and Urban
Systems, 24 (3), 215-232
Hall, J. D., Biles, W. E. and Leach, J. (1994) An autocad-12 based maintenance management system for manufacturing, Computers
industrial engineering, 29 (1-4), 285-289
Hansen, W. A., Edson B. N. and Larter, P. C. (1992) Reliability, availability and maintainability expert systems (RAMES), Annual
Reliability and Maintainability Symposium, 285-289
Ho, T.-L., Bayles, R. A. and Havlicsek, B. L. (1988) A diagnostic expert system for aircraft generator control unit (GCU), Proceedings
of the IEEE 1988 National Aerospace and Electronics Conference, 1988. NAECON 1988, 4, 1355-1362
Hoffer, J., Prescott, M., McFadden, F. R. (2005) Modern Database Management, Upper Saddle River, Pearson/Prentice Hall.
Hosaka, T., Ueda, K. and Matsuura, H. (1981) A Design Automation System for Electronic Switching Systems, 18th conference on
Design Automation 29-1 June 1981, 51-58
Itaka, K., Matsubara, I., Nakano, T., Sakurai, K. and Taga H. (1991) Advanced maintenance information systems for overhead power
transmission lines, APSCOM-91., 1991 International Conference on Advances in Power System Control, Operation and
Management, 2, 927-932
Jones, R. (1994) Computer-aided maintenance management systems, Computing & Control Engineering J., 5 (4), 189-192
Jonsson, P. (1999) The Impact of Maintenance on the Production Process-Achieving High Performance, Lund University, Institute of
Technology ,Department of Industrial Engineering, Division of Production Management
Jonsson, P. (2000) Towards a holistic understanding of disruptions in Operations Management, J. of Operations Management, 18, 701718
Kelly, A. (1984) Maintenance planning and control, Butterworths, London
Kelly, G. J., Aouad, G., Rezgui, Y. and Crofts, J. (1997) Information systems development in the UK construction industry, Automation
in Construction, 6, 17-22
Kini, R. B. (2002) IT in manufacturing for performance: the challenge for Tai manufacturers, Information Management & Computer
Security, 10 (1), 41-48
100
Labib, A. W. (1998) World-class maintenance using a computerised maintenance management system, J. of Quality in Maintenance
Landers, T., Nguyen, M. and Delgado, R. (1989) A digital maintenance information (DMI) system for ATE, AUTOTESTCON '89. IEEE
Automatic Testing Conference. The Systems Readiness Technology Conference. Automatic Testing in the Next Decade and the 21st
Century. Conference Record, 272-276
Laugier, A., Allahwerdi, N., Baudin, J., Gaffney, P., Grimson, W., Groth, T. and Schilders, L. (1996) Remote instrument
telemaintenance, Computer Methods and Programs in Biomedicine, 50 (2), 187-194
Link, W. R. (1989) The IMIS F-16 interactive diagnostic demonstration, Proceedings of the IEEE 1989 National Aerospace and
Electronics Conference, NAECON 1989, 3, 1359-1362
Mitchell, J. (1991) Research into a sensor-based diagnostic maintenance expert system for the hydraulics of a continuous mining
machine, Conference Record of the 1991 IEEE Industry Applications Society Annual Meeting, 2, 1192-1199
Mullin, A. (1989) The application of information technology to asset preservation in Ford of Europe, IEE Colloquium on IT in the
Management of Maintenance, 2/1-2/4
Nikolopoulus, K., Metaxiotis, K., Lekatis, N. and Assimakopoulos, V. (2003) Integrating industrial maintenance strategy into ERP,
Industrial Management & Data Systems, 103 (3), 184-191
Nolan, R. L. (1979) Managing the crises in data processing, Harvard Business Review, 57 (2), 115-126
Nolasco, A. (1994) Computerized maintenance management systems (CMMS) in cement plants, World Cement, 25 (12), 44-48
Ostebo, R. (1993) System-effectiveness assessment in offshore field development using life-cycle performance simulation, Proceedings
of the Annual Reliability and Maintainability Symposium, 1993, 375-385
Panucci, D. (2000) Take CMMS seriously, Manufacturing computer solutions, 6 (5), 25
Pearce D. F. and Hall, S. (1994) Using vibration monitoring techniques to minimize plant downtime, IEE, 8/1-8/2
Persson, P. O., Boberg, K-E., Broms, I., Docherty, P., Kraulis, G. and Kreimer, B. (1981) 80-talet på en bricka. Datateknik; utveckling
och miljö 1980-1990 (The 80s on a tray. Computer technology; development and environment 1980-1990), Riksdataförbundet,
Stockholm
Raouf, A., Ali, Z. and Duffuaa, S. O. (1993) Evaluating a Computerized Maintenance Management System, International J. of
Operations & Production Management, 13 (3), 38-49
Sato, K., Atsumi, S., Shibata, A. and Kanemaru, K. (1992) Power transmission line maintenance information system for Hokusei line
with snow accretion monitoring capability, IEEE Transactions on Power Delivery, 7 (2), 946-951
Sherwin, D. J. and Jonsson, P. (1995) TQM, maintenance and plant availability, J. of Quality in Maintenance Engineering, 1 (1), 15-19
Toussaint, P. J., Bakker, A. R. and Groenewegen, L. P. J. (2001) Integration of information systems: assessing its quality, Computer
Methods and Programs in Biomedicine, 64, 9-35
Tuunainen, V. K. (1998) Opportunities of effective integration of EDI for small businesses in the automotive industry, Information and
Management, 34 (6), 361-375
Walters, M. D. (1990) Inductive learning applied to diagnostics, AUTOTESTCON '90. IEEE Systems Readiness Technology Conference.
'Advancing Mission Accomplishment', Conference Record, 167-174
Weil, M. (1998) Raising the bar for manintenance apps, Manufacturing Systems, 16 (11), 5
Wichers, J. H. (1996) Optimising maintenance functions by ensuring effective management of your computerised maintenance
management system, IEEE AFRICON 4th AFRICON, 2, 788-794
Wilder P. and Cannon M. (1993) Advantages of a computerized maintenance management system in managing plant operations, Textile,
Fiber and Film IEEE 1993 Annual Industry Technical Conference, 5/1-512
Wilson, A. (1984) Planning For Computerised Maintenance, Conference Communication, UK
Yagi, Y., Kishi, H., Hagihara, R., Tanaka, T., Kozuma, S., Ishida, T., Waki, M., Tanaka, M. and Kiyama, S. (2003) Diagnostic
technology and an expert system for photovoltaic systems using the learning method, Solar Energy Materials and Solar Cells, 75
(3-4), 655-663
Yam, R. C. M., Tse, P. W., Li, L. and Tu, P. (2001) Intelligent Predictive Decision Support System for Condition-Based Maintenance,
International J. of Advanced Manufacturing Technology, 17. (5), 383-391
Zhang, J., Tu, Y. and Yeung, E. H. H. (1997) Intelligent decision support system for equipment diagnosis and maintenance management,
Innovation in Technology Management - The Key to Global Leadership. PICMET '97: Portland International Conference on
Management and Technology, 733
Wang Z., Guo, J., Xie, J. and Tang, G. (2002) An introduction of a condition monitoring system of electrical equipment, Proceedings of
2001 International Symposium on Electrical Insulating Materials, (ISEIM 2001), 221-224
101
Some generalizations of age and block replacement
Phil Scarf
Centre for OR and Applied Statistics, Salford Business School, University of Salford, Manchester, M5 4WT, UK
[email protected]
Abstract: Consider a component that is subject to failure and preventive replacement. The simplest preventive
replacement policies replace the component either when the component age reaches a critical limit (age based
replacement) or at regular intervals of calendar time (block replacement). Various criteria may be used to
quantify the properties of these preventive replacement policies; these are principally cost (cost per unit time)
and reliability (distribution of the time between failures of the operational function of the component). We show
how these criteria are related and that a value of one implies the other. Furthermore, while it is generally
accepted that age based replacement is cost efficient with respect to block replacement, there are circumstances
under which it is not. These circumstances are investigated. The ideas are illustrated using a case study relating
to traction motor replacement.
102
Scheduling the imperfect preventive maintenance policy for deteriorating systems
Yunxian Jia *, Xuhua Chen
Department of Management Engineering, Shijiazhuang Mechanical engineering College, 97 Hepingxi Road,
shijiazhuang 050003, P.R.China
[email protected]
Abstract: Imperfect preventive maintenance can reduce the wear out and aging effects of deteriorating systems
to a certain level between the conditions of as good as new and of as bad as old. The hybrid hazards rate
recursion rule based on the concept of age reduction factor and hazards rate increase factor was built up to
measure the extent of restoration for deteriorating systems and predict the evolution of the system reliability in
different maintenance cycles. After obtaining the parameters of the hybrid hazards rate, two different situations
were considered when optimizing the imperfect preventive maintenance policy. In these situations, whenever
the system reliability reaches threshold R, which was determined by minimizing the cumulative maintenance
cost per unit time in the life cycle of the system, imperfect preventive maintenance is performed on the system.
Finally, a numerical example is presented to validate the methods discussed above and some conclusions are
provided.
Keywords: imperfect maintenance, preventive maintenance, hybrid hazards rate, reliability, cost optimisation
Introduction
Maintenance optimization has been a popular issue to researchers since early in the 1960s and a lot of optimal
maintenance strategies have been developed and implemented for improving system availability, preventing
system failure risk and reducing maintenance costs, Zhou et al. (2006).
For the diversity of degree of maintenance, Pham & Wang (1996) subdivided maintenance work into five
different types: the Perfect Maintenance, the Minimal Repair, the Imperfect Maintenance, the Worse
Maintenance and the Worst Maintenance. It assumes that perfect maintenance can restore the system back to the
condition of as good as new, which means that the restored system or equipment has the same hazards rate
function and reliability function as a new one. Minimal repair assumes that it can repair the equipment to the
condition of as bad as old, which means this work merely eliminates the failure and the repaired system has the
same property as it does before the failure. Imperfect maintenance can reduce the wear out and aging effects of
deteriorating systems to a certain level between the conditions of as good as new and of as bad as old. In fact,
preventive maintenance is generally imperfect and it cannot restore the system to as good as new in most
situations.
Under the imperfect preventive maintenance policy, the system is maintained at a decreasing sequence of
intervals, which is more practical since most systems need more frequent maintenance with increased usage and
age. There are several methods to model an imperfect preventive action. One of the most useful methods in
engineering is the improvement factor method that is in terms of the system hazards rate or other reliability
measures. In the literature, two different types of improvement factors are developed. Malik (1979) introduces
the concept of the age reduction factor. If Ti and hi ( t ) for t ∈ ( 0, Ti ) , respectively, represent the preventive
maintenance interval and the hazards rate function of the system prior to the ith preventive maintenance, the
hazards rate function after the ith preventive maintenance becomes hi ( t + aiTi ) for t ∈ ( 0, Ti +1 ) , where 0 < ai < 1
is the age reduction factor due to imperfect preventive maintenance action. This implies that each imperfect
preventive maintenance changes the initial hazards rate value right after the preventive maintenance to hi ( aiTi ) ,
but not all the way to zero. Nakagawa (1988) proposes another model based on the hazards rate increase factor.
The hazards rate function becomes bi hi ( t ) for t ∈ ( 0, Ti +1 ) after the ith preventive maintenance, where bi > 1 is
the hazards rate increase factor. This indicates that each preventive maintenance resets the increase rate of the
hazards rate function higher and higher.
In order to benefit from both the age reduction method, that has the advantage of determining the initial
failure rate value right after a preventive maintenance, and the hazards rate increase method, that has the
advantage of allowing the increase rate of the hazards rate function to be higher after each preventive
maintenance. Zhou et al. (2006) proposed a hybrid hazards rate evolution rule based on these two methods. It
assumed that the relationship between the hazards rate functions before and after the ith preventive maintenance
can be defined as
hi +1 ( t ) = bi hi ( t + aiTi ) for t ∈ ( 0, Ti +1 ) ,
(1)
103
where, 0 < ai < 1 and bi > 1 are the age reduction factor and the hazards rate increase factor respectively, which
need to be deduced from the history maintenance data of the system. As shown in Figure 1, if ai = 0 , the hybrid
hazards rate function reduces to that proposed by Nakagawa (1988); if bi = 1 , the hybrid hazards rate function
reduces to that proposed by Malik (1979). This hybrid recursion rule makes it possible to predict the evolution
of the system reliability in different maintenance cycles.
h (t )
hi+1( t) =bh
i i ( t +aT
i i)
hi+1( t) =hi ( t +aT
i i)
hi ( t )
hi+1( t) =bh
i i ( t)
t
Figure 1. Hybrid hazards rate function
Aiming at minimizing the cumulative maintenance cost per unit time in the life cycle of the system, this paper
applies the hybrid hazards rate recursion rule proposed by Zhou et al. (2006) to measure the extent of the
imperfect preventive maintenance for deteriorating systems and predict the evolution of the system reliability in
different maintenance cycles. Considering two different situations met in practice, the authors deduced the
related formulas and algorithms to optimize the maintenance policies as described in section 2. When obtaining
the optimal R and N * , the maintenance engineers or managers could make their optimal schedule to reduce the
maintenance cost. In section 3, numerical examples concerning the two situations were presented to illustrate
the methods provided above. Also, some results and conclusions about the methods were provided in the last
section.
1. Model development
1.1. Notation and assumptions
Notation:
i
ordinal of preventive maintenance cycles, i = 1, 2,L
N
preventive maintenance cycle number
*
N
optimal preventive maintenance cycle number
Ti
time interval for preventive maintenance prior to the ith maintenance
hi ( t ) system hazards rate function prior to the ith preventive maintenance
ri ( t )
R
cp
system reliability function prior to the ith preventive maintenance
system reliability threshold for the scheduled preventive maintenance
preventive maintenance cost
cf
replacement cost
cost brought by the system’s failure
E (c)
expected total cost for the system in the life cycle
E (t )
expected operational time for the system in the life cycle
cr
ϕ ( R, N ) expected cost per unit time for the system in the life cycle
In this section, two situations often met in practice will be taken into account. In the first situation, people
periodically take some preventive maintenance actions on the system before it suffers a failure. If the system
fails, a replacement action will be performed. In other words, if and only if the system suffers no failure, the
imperfect preventive maintenance will be performed, and perfect maintenance (replacement) action will be
taken if the system fails. But in the second situation, imperfect preventive maintenance or imperfect corrective
maintenance will be taken whenever the system reliability reaches the threshold R or whenever the system fails
before the scheduled preventive maintenance. And when the preventive maintenance cycle number reaches N*,
replacement will be performed when the system reliability reaches R or when the system fails. That means the
optimal preventive maintenance cycle number N* and the system reliability threshold R for scheduled preventive
104
maintenance are to be decided to minimize the expected cost per unit time for the system in the life cycle.
Compared with the operational time for the system, the duration of preventive maintenance or replacement is
short enough to be ignored.
1.2. Model formulation for situation 1
In this situation, a scheduled preventive maintenance is performed whenever the system reliability reaches the
reliability threshold R, and a replacement is performed whenever the system suffers a failure. Based on this
policy, a reliability equation can be constructed as
(
)
(
)
(
exp − ∫ h1 ( t ) dt = exp − ∫ h2 ( t ) dt = L = exp − ∫
T1
0
T2
0
TN
0
)
hN ( t ) dt = R
(2)
where hi ( t ) can be deduced from equation (1). Equation (2) can be rewritten as
∫0 h1 ( t ) dt = ∫0
T1
where
T2
h2 ( t ) dt = L = ∫
TN
0
hN ( t ) dt = − ln R
(3)
∫0 hi ( t ) dt represents the cumulative failure risk in maintenance cycle i . This implies that the cumulative
Ti
risk of system failure in each maintenance cycle is equal to –lnR. Since only one preventive maintenance action
is performed in one maintenance cycle, the probability to implement a scheduled preventive maintenance action
is R and the probability to implement a replacement action is 1 – R..
For this situation, the expected total cost E ( c ) for the system in the life cycle can be calculated as
E ( c ) = ( cr + c f ) ⋅ (1 − R ) + c p ⋅ R + ( cr + c f ) ⋅ (1 − R ) ⋅ R + c p ⋅ R 2 + L
= ( cr + c f ) +
cp ⋅ R
1− R
and the expected operational time E ( t ) for the system in the life cycle can be write as
(4)
E ( t ) = ∫ r1 ( t )dt + R ⋅ ∫ r2 ( t )dt + R 2 ⋅ ∫ r3 ( t )dt + L
T1
T2
0
T3
0
∞
0
= ∑ Ri ⋅ ∫
i =0
Ti +1
0
ri +1 ( t ) dt
(5)
Since the expected operational time E ( t ) calculated by equation (5) will be different as the system
reliability function ri ( t ) changes, there is no unified formulation for E ( t ) as for E ( c ) . However, it can be proved
that the sum of the infinite series in equation (5) will converge for the deteriorating system and the
approximation for E ( t ) could be calculated using the numerical algorithm for the given ri ( t ) .
After obtaining E ( c ) and E ( t ) , the expected cost per unit time ϕ ( R, N ) for the system in the life cycle can
also be calculated by
E (c)
ϕ ( R, N ) = ϕ ( R ) =
(6)
E (t )
For the given ri ( t ) , both ϕ ( R ) and Ti are functions of R . By minimizing ϕ ( R ) , the optimal system reliability
threshold for scheduled preventive maintenance can be determined. At the same time, the value of Ti from
equation (3) can be calculated which should be helpful for preparing the scheduled preventive maintenance
activities.
1.3. Model formulation for situation 2
In situation 2, preventive maintenance would be scheduled for N cycles. During the N − 1 cycles, imperfect
preventive maintenance or imperfect corrective maintenance will be performed whenever the system reliability
reaches the threshold R or whenever the system suffers a failure before the scheduled preventive maintenance
and replacement action will be taken just at the cycle N whether the system reliability reaches the threshold R or
the system fails. So, there are two variables, optimal cycle numbers N * and reliability threshold R , to be chosen
to minimize the expected cost per unit time.
Based on this maintenance scheduling strategy, the expected total cost E ( c ) for the deteriorating system in
the life cycle is
N −1
E ( c ) = ∑ ( c p + c f ⋅ (1 − R ) ) + ( cr + c f ⋅ (1 − R ) )
i =1
105
(7)
and the expected operational time E ( t ) for the deteriorating system in the life cycle is
N
Ti
i =1
0
E ( t ) = ∑ ∫ ri ( t ) dt
(8)
After calculating E ( c ) and E ( t ) , the expected cost per unit time ϕ ( R, N ) for the system in the life cycle can
be determined by
E (c)
ϕ ( R, N ) =
(9)
E (t )
In this situation, Ti can be obtained from Equation (3) as in situation 1 and it is a function of R . By
minimizing ϕ ( R, N ) , which is the function of R and N , the optimal system reliability threshold R for scheduled
preventive maintenance and the optimal maintenance cycle number N * can be determined. But how to search for
the best R and N * through numerical computation is still a complicated and time-consuming work.
To make it simplified, for the given cycle number N , the first derivative of ϕ ( R, N ) with R is
Ti
⎛
N
N d
ri ( t ) dt ⎞
∫
⎜E t ⋅ c + E c ⋅
0
(
)
(
)
∑ f
∑ dR ⎟⎟
⎜
i =1
i =1
⎜
⎟
d ϕ ( R, N )
⎠
= −⎝
2
dR
( E (t ))
d ∫ ri ( t ) dt
(10)
Ti
0
dR
Let
d ϕ ( R, N )
dR
= ai −1 ⋅ ∫ ri ( t ) ⋅ ( hi ( t ) − hi ( 0 ) ) ⋅
Ti
0
dTi −1
1
dt −
dR
hi (Ti )
(11)
= 0 , from equation (10), there will be
N ⎛
Ti
dT
1 ⎞
N ⋅ c f + ε ⋅ ∑ ⎜ ai −1 ⋅ ∫ ri ( t ) ⋅ ( hi ( t ) − hi ( 0 ) ) ⋅ i −1 dt −
⎟=0
⎜
0
dR
hi (Ti ) ⎟⎠
i =1 ⎝
where ε = ϕ ( R, N ) , which is a function of R for a given N .
(12)
In order to make finding the optimal reliability threshold R and the maintenance cycle number N * less
complicated and quicker, the following steps are recommended;
1. Set the search ranges N ∈ [1, m ] and let n = 1 .
2. For the given N = n , solve the equation (12) to find R by numerical method such as Newton-Raphson
iteration. The R that satisfies (12) is the optimal reliability threshold which minimizes ϕ ( R, N ) for the
given N .
3. If n < m , let n = n + 1 and go back to step 2; else, go to the next step.
4. Until now, the optimal system reliability threshold R has been found for the given N. The
minimum ϕ ( R, N ) and the corresponding R and N are what we expect to find.
After obtaining the optimal system reliability threshold R and the optimal preventive maintenance cycle
number N * , preventive maintenance activities could be scheduled to reduce the maintenance cost for the
deteriorating system.
3. Numerical examples
In this section, we present some numerical analysis to validate the maintenance decision models discussed
above. Generally speaking, maintenance engineers should be responsible for the determination of the system
hazards rate functions and the improvement factors, which can be obtained from their experiences or through
statistical methods if a large quantity of maintenance data is available. In these numerical examples, a Weibull
distribution was used to describe the hazards rate function for the deteriorating system and the improvement
δ −1
factors ( ai , bi ) were assumed to be known. It is assumed that h1 ( t ) = δ α t α
, where the shape
parameter δ = 4 and the scale parameter α = 120 . Also, let ai = i 3i + 5 , bi = 12i + 310i + 3 , c p = 100 , cr = 1000
and c f = 1200 . The results for two different situations are presented below.
( )( )
106
3.1. Results for situation 1
In situation 1, the expected total cost E ( c ) for the system in the life cycle can be obtained by equation (4) and
the approximation for the expected operational time E ( t ) for the system in the life cycle can be estimated by
equation (5) through setting a large maintenance cycle number N . After obtaining E ( c ) and E ( t ) , the expected
cost per unit time ϕ ( R ) for the deteriorating system in the life cycle can be calculated by equation (6). Here, the
search ranges are set as R ∈ [ 0.60,1.00 ) (step=0.01) and N = 200 . The results are shown in Table 1 and Figure 2.
R
ϕ ( R)
R
ϕ ( R)
R
ϕ ( R)
R
ϕ ( R)
R
ϕ ( R)
0.60
0.61
0.62
0.63
0.64
0.65
0.66
0.67
13.430
13.344
13.259
13.157
13.096
13.018
12.942
12.869
0.68
0.69
0.70
0.71
0.72
0.73
0.74
0.75
12.800
12.733
12.671
12.612
12.558
12.509
12.466
12.428
0.76
0.77
0.78
0.79
0.80
0.81
0.82
0.83
12.398
12.375
12.361
12.358
12.366
12.388
12.425
12.482
0.84
0.85
0.86
0.87
0.88
0.89
0.90
0.91
12.560
12.666
12.804
12.983
13.212
13.507
13.887
14.381
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
15.053
15.921
17.158
18.972
21.822
26.824
37.524
73.466
Table 1. Results for situation 1
From Table 1, it can be seen that the lowest cost per unit time for the deteriorating system in the life cycle
is 12.358 and the optimal system reliability threshold R for the preventive maintenance is 0.79. The first five
scheduled preventive maintenance for this reliability threshold can be performed at 83.614, 70.23, 54.441,
40.335 and 29.397. From Figure 2, it can also be seen that as the system reliability increases, the cost per unit
time for the deteriorating system in the life cycle decreases at first and then increases rapidly after it reaches the
lowest point. This conclusion is true for all the deteriorating systems even when they have different hazards rate
functions and different maintenance costs.
Other numerical results are also obtained with the variation of c p , cr and c f respectively, while other
parameters are fixed. It can be concluded that; (1) The optimal reliability threshold R gives a gradual decrease
with the increase of preventive maintenance cost c p . This means that the more expensive the preventive
maintenance, the less frequent the scheduled preventive maintenance works.
Figure 2. Relationship between ϕ ( R ) and R
(2) With the increase of the replacement cost cr or system’s failure cost c f , the optimal system reliability
threshold R exhibits a gradual increase. This implies that more frequent the scheduled preventive maintenance
actions should be performed as the replacement or system’s failure cost more.
3.2. Results for situation 2
For this situation, the steps to find the optimal reliability threshold R and maintenance cycle number N * provided
in section 2.3 are as follows. Firstly, let m = 20 then the search ranges is N ∈ [1, 20] ; secondly, for the given
cycle number n , find the optimal reliability threshold Rn that minimizes the expected cost per unit time ϕ ( R, n ) ;
lastly, when the system reliability threshold Rn for all the cycles have been found, select the lowest
107
ϕ ( R, N * ) among ϕ ( R, n ) and the corresponding Rn and n are the optimal reliability threshold R and the optimal
cycle number N * . The results obtained are listed in Table 2 and described in Figure 3.
Cycle
Number
Reliability
Threshold
Cost per
Unit Time
1
0.75162
15.624
87.718
2
0.85793
10.769
75.080
63.061
3
0.89566
9.3798
69.138
58.071
45.015
4
0.91490
8.9603
65.532
55.042
42.668
31.612
5
0.92650
8.9529
63.076
52.979
41.068
30.427
22.176
6
0.93424
9.1550
61.283
51.473
39.901
29.562
21.546
15.788
7
0.93977
9.4757
59.910
50.320
39.007
28.900
21.063
15.434 11.494
8
0.94390
9.8693
58.821
49.405
38.298
28.375
20.680
15.154 11.286 8.5634
9
0.94712
10.311
57.934
37.721
27.947
20.369
14.925 11.115 8.4344
10
0.94969
10.786
48.661
6.5113
48.041
6.4284
37.241
5.0357
27.591
20.110
14.735 10.974 8.3270
Time Intervals for Scheduled Maintenance
57.197
Table 2. Results for situation 2
Figure 3. Relationships between N , R and ϕ ( R, N )
From Table 2, it can be seen that the lowest cost per unit time in the system’s life cycle is 8.9529 and the
optimal system reliability threshold for the preventive maintenance is 0.92650. A replacement will be taken on
the deteriorating system at the 5-th preventive maintenance cycle. The corresponding time intervals for the
preventive maintenance were provided in the table and it can be clearly seen that the time intervals for the
scheduled preventive maintenance decreases as the cycle number increases, which also indicates that the
preventive maintenance is imperfect and the system is subject to a degradation process.
From Figure 3, it can be seen that as the preventive maintenance cycle number n increases, the optimal
reliability threshold Rn for the given n increases, too. But the expected cost per unit time ϕ ( R, n ) for the
deteriorating system decreases sharply at first and then increases slowly after it reaches the lowest cost.
Generally, this is true for all the deteriorating systems under imperfect preventive maintenance in the scheduled
finite maintenance cycle.
Conclusions
With the consideration of the imperfect preventive maintenance, the hybrid hazards rate recursion model is
applied to investigate the restoration effect after performing preventive maintenance on the deteriorating system.
108
Two different situations that are often met in practice have been considered and the corresponding models are
also developed to minimize the expected cost per unit time for the system in the life cycle. Whenever the system
reliability reaches the threshold R , which is deduced by minimizing the expected cost per unit time, a scheduled
imperfect maintenance action will be performed on the system. For the second situation, the optimal cycle
number N * as well as the reliability threshold R can also be determined by the cost optimization rules.
Through the numerical examples, it can be seen that the models described in this paper can be applied in
practice to schedule the maintenance work. The results and discussions provided in the numerical methods
exhibit how the optimal schedule depends on different cost parameters in different situations. With these results
and discussions, maintenance personnel can make their maintenance work schedule for the deteriorating system
under imperfect maintenance and it is possible for the enterprises to perform preventive maintenance actions
with near-zero inventory for the spare parts.
In order to apply the imperfect maintenance models in practice more effectively and efficiently, further
research, especially on deciding the improvement factors ( ai , bi ) in this hybrid hazards rate recursion model, will
be needed to perfect these models.
References
Zhou, X., Xi, L. and Lee, J. (2006) Reliability-centered Predictive Maintenance Scheduling for a Continuously Monitored System
Subject to Degradation, Reliability Engineering and System Safety, 1
Pham, H. and Wang, H. (1996) Imperfect Maintenance, European J. of Operational Research, 94, 425-438
Wang, H. and Pham, H. (1996) Optimal Age-dependent Preventive Maintenance Policies with Imperfect Maintenance, International J. of
Reliability, Quality and Safety Engineering, 3 (2), 119-135
Malik, M. A. K. (1979) Reliable Preventive Maintenance Policy, AIIE Trans., 11 (3), 221-228
Nakagawa, T. (1988) Sequential Imperfect Preventive Maintenance Policies, IEEE Trans Reliab., 37 (3), 295-298
Cheng, C. and Chen, M. (2003) The periodic Preventive Maintenance Policy for Deteriorating systems by Using Improvement Factor
Model, International J. of Applied Science and Engineering, 1 (2), 114-122
Chen, Y. and Cheng. C. (2002) The System Development of Ordinary Periodic Preventive Maintenance Model, Department of
Industrial Engineering and Management Chaoyang University of Technology, Aug. 30, 2002
Kuang, Y., Miao, O. and Huang, H. (2006) Optimizing Sequential Imperfect Preventive Maintenance for Equipment in Life Cycle,
Proceedings of the First International Conference on Maintenance Engineering, 242-249
Jayabalan, V. and Chaudhuri, D. (1992) Cost Optimization of Maintenance Scheduling for a System With Assured Reliability, IEEE
Trans Reliab, 41 (1), 21-25
Lin, D., Zuo, M. J. and Yan. R. C. M. (2000) General Sequential Imperfect Preventive Maintenance Models, International J. of
Reliability, Quality and Safety Engineering, 7 (3), 253-266
109
Contribution to modelling of dynamic dependability of complex systems
David Valis
University of Defence, Kounicova 65, 612 00 Brno, Czech Republic
[email protected]
Abstract: The paper deals with the dependability assessment of complex systems. As we investigate situations
regarding military applications the dynamic dependability is very important for us. Dependability characteristics
of military battle equipment have the same importance for us as those which have to serve to perform battle
missions itself. There is no time on the battle field to solve unpredicted and unexpected situations caused by
unreliability which might lead to loss of both equipment and crew. Due to the high level of risk we face on the
battlefield, many systems have to be robust enough or have to be redundant to succeed.
1. Introduction
As we know, there are a number of characteristics which might be investigated and solved regarding military
applications. Some of them are typically related to the performance of the object although others are related to
supporting characteristics. The supporting characteristics do not mean that they play a second class role but
usually are not preferred as much as those related to performance. In our branch of interest we talk about
dependability and its attributes. The common and well known dependability characteristics are often announced
and used for various calculations as well as to describe the item itself. We typically know these characteristics
from different types of tests performed during the development and testing phase. Such characteristics are
related to the so called inherent dependability – inherent availability. Apart from these specifications, we also
need to know the real behaviour in the battle field – in real deployment while completing mission. In the real
deployment we talk about characteristics related to the “so called” operational dependability – operational
availability. These characteristics are not calculated theoretically but their calculation is based on practical and
real possible situations. Such as, the real picture about technical item behaviour, namely military battle vehicles,
is the most important for us. Several measures join the set of “dynamic dependability” characteristics. To be
able to carry out the dynamic dependability analysis we have to know the edge conditions and our limitations
for that. Dynamic in these terms means to have the information we need just in time. We may for instance
choose several possibilities for getting the time related characteristics regarding the military battle vehicle. One
of the most appropriate seems to be Markov analysis. Beyond the dynamic characteristics, we also need to know
the potential risk level in case of unexpected event occurrence both during the training phase and during real
deployment while completing a mission.
If we talk about dynamic dependability we take into account those events which have the major impact on a
vehicle’s function – a failure. The only failures we assess are the failures from internal reasons. We do not
count the possible failures caused by external reasons – in case of battle vehicles caused by hit or attack while
performing a mission. In the following parts, we deal with all the above mentioned issues.
2. Risk on battle field and its assessment
In our lives we can recognise and we know plenty of circumstances which may generate the existence of a risk.
As we talk about a risk, we subconsciously feel something wrong, negative, and unpleasant. We feel
endangered or possibly a hazard. The more we know about risk, the harder we cope with it. In some situations,
we can not do anything else other than get used it. In some cases, we may avoid it, reduce it or ignore it. There
are many ways of observing a risk and ways of handling it. The whole discipline dealing with risk has the name
“Risk management” and its fragments have crucial importance for us. Due to the fact that we are dealing with
military battle vehicles, we have to recognise a bit more than the standard risk spectrum – the risk profile we
usually see for civilian vehicles. As the battle vehicles have to perform their mission in very difficult
environments under very adverse conditions, the spectrum of possible impacts is very high. We talk about
sources of risk. A battle vehicle has the potential to be in collaboration with more than one source of risk, it does
not really matter if the vehicle carries out training or if it is in real deployment. Of course, the real deployment
may bring more consequences in the case of an event occurrence. A failure in training does not necessarily need
to be as crucial as one in the case of a real mission. A failure occurrence either in training or in a real mission
puts the vehicle into an involuntary situation which is raised due to the military tasks it has to fulfil. Due to the
very high possibility of being immediately attacked in the battle, the risk that arises is also very high.
Considering the above, we use following description of risk for further work.
Let there exist a certain source of risk, either tangible (environment, object, human being) or intangible
(activity). This source can have both positive, but as in our case also negative impact to its surroundings (other
110
tangible or intangible elements). The existence of this impact is not always so important. The existence of such
risk (i.e. negative impact) becomes important only when its impact or importance results from an interaction,
which exists between an element (individual, group, technical object or activity) and a source (environment or
activity).
In this moment it is necessary to realize that risk as such does not exist, if there is no interaction between
the source of risk and the object (element) that has a certain relationship to this source. It is necessary to take
into account the fact that interaction can also have various forms. It may be, for example, a voluntary,
involuntary, random, intentional, etc. interaction. The effect of these impacts can be attributed especially to an
environment, in which the object occurs during its existence. Any such impacts shall be generally called the
area of risk.
The important and integral part of all analyses will be precise, qualitative and sufficient identification of
this source of risk alone. Without this source, we can hardly deal with a risk in a qualified way. Considering
these facts, we may understand that risk can be assessed both qualitatively and quantitatively (of course in both
cases as well). Basic expressions which put risk into a commonly understandable form and which enable further
dealing with risk are as follows. The first and very well known description is in the form of an equation which
may serve both for qualitative and quantitative assessment:
R = P×C
(1)
where:
R – Risk;
P – Probability;
C – Consequences.
This expression allows us to carry out both qualitative and quantitative assessments. The problem is that we do
not have any numerical expressions with physical units.
The second very well known form for risk expression is the following formula:
P×C
R=
×E
(2)
M
where:
R – Risk;
P – Probability;
M – Measures;
E – Exposition.
This expression allows us also to carry out both qualitative and quantitative assessments. A very big advantage
is that we may have physical units related to risk for further analysis.
For every element of the equations mentioned above, there are more or less clear procedures for their
determination. We have to understand that risk assessment as part of risk management is subdivided in two
possible ways. In terms of finding the solution, we either talk about “Logic (sometimes determination) Access”
or “Probabilistic Access”. In the case of probability, the situation is more than clear. Although in English
speaking countries we have to distinguish between the terms “Probability” and “Likelihood”, the determination
is clear enough. In the case of exposition, we do not have to discuss very much the possibility for unit and
function determination. We may expect problems in terms of measure or consequence determination. Such
decisions are more or less based on expert expressions. This way is not necessarily bad but it does not give us
the possibility to validate or verify a statement made.
From this point of view, as well as from our own historical experience, we recommend using new
progressive forms and procedures for measure and consequence determination. As we very often work with
language and qualitative measures which are consequently somehow connected to scales (numerical expressions
of qualitative expressions) we would like to be sure enough that our decision was not bad and that in the same
circumstances under the same conditions, one day later it will be made in the same way. The theory of fuzzy
probability and fuzzy logic seems to suit to this purpose very well. For more details on how to solve such an
issue, see Vališ (2006(b)).
3. Counting distribution of observed variable and dependability
Based on the part describing risk assessment above, we have been looking for expressions of object behaviour.
Such behaviour will give us an appropriate picture about the real conditions of the object and will allow us to
prepare possible missions with such an object. From a mathematical point of view, we may distinguish between
two ways of observing object behaviour. Such as behaviour based on measures and characteristics used. In this
part, we would like to describe a potential way for the dependability assessment of a complex technical system
which is represented by a counting value that is an observed variable. We know the basic characteristics and
measures related to the object. Also, in this case – solving the issue related to the counting variable – we use
Markov analysis for getting several characteristics of dynamic dependability. We have chosen an automatic
111
cannon which shoots using rounds for our purpose and description in such a case. If a failure on a round occurs,
the part restoration system allows it to re-charge a faulty round with a new one. We talk about partial repair.
The system may basically stay in two states as described bellow using scenarios for their description.
The mission is completed. In the first case there can be a situation when all the ammunition of a certain
amount which is placed in an ammunition belt is used up and a round failure occurs, or it is used up and a round
failure does not occur. In this case a backup system of pyrotechnic cartridges is able to reverse a system into an
operational state. Using up can be single, successive in small bursts with breaks between different bursts, or it
might be on mass using one burst. Shooting is failure free or there is a round failure occurrence n. In case a
round failure occurs, a system which restores a function of pyrotechnic cartridges is initiated.
There are two scenarios here too – a system restoring a pyrotechnic cartridges function is failure free, or a
pyrotechnic cartridge fails.
If a function of pyrotechnic cartridges is applied, it can remove a failure m times. So the number of
restorations of the function is the same as the number of available pyrotechnic cartridges. In order to complete
the mission successfully we need a larger amount of pyrotechnic cartridges m, or in the worst case, the number
of pyrotechnic cartridges should be equal to the number of failures.
Another alternative is the situation that a round fails and in this case a pyrotechnic cartridge fails too. A
different pyrotechnic cartridge is initiated and it restores the function. This must satisfy the requirement that an
amount of all round failures n is lower or at least equal to a number of operational (undamaged) pyrotechnic
cartridges m.
The mission is completed in all the cases mentioned above and when following a required level of
readiness of a block A.
The mission is not completed. In the second case, the shooting is carried out one at a time, in small bursts
or in one burst, and during the shooting there will be n round failures. At the time the failure occurs, a backup
system for restoring the function will be initiated. Unlike the previous situation there will be m pyrotechnic
cartridge failures and a total number of pyrotechnic cartridge failures equalling at least the number of round
failures and at most, equal to the number of implemented pyrotechnic cartridges M. It might happen in this case
that a restoring of the function does not take place and the mission is not completed at the same time because
there are not enough implemented pyrotechnic cartridges.
1. An alternative of a function when the mission is completed.
1 - P(B)
1 - P(B)
1 - P(B)
1 - P(C)
1 - P(B)
P(B)
m1
0
P(C)
m2
2. An alternative of a
function when the
mission is not completed.
1
1 - P(C)
1 - P(C)
……
mm
1
P(C)
P(C)
Figure 1. Description of transitions among the states
Characteristics of the states:
0 state:
An initial state of an object until a round failure occurs with a probability function of a round
P(B). It is also a state an object can get with a pyrotechnic cartridge probability P(C) in case a
round failure occurs P B = 1 − P(B ) ,
()
or P(C| B ) =
P( C ∩ B )
.
P( B )
m1…mm state: A state an object can get while completing the mission. Either a round failure occurs in
probability P B = 1 − P(B ) , or there is a pyrotechnic cartridge failure in probability
()
1 state:
()
P C = 1 − P(C ) .
A state an object can get while completing the mission. It is so called an absorption state.
Transition to the state is described as probability
112
()
P C = 1 − P(C ) of a failure of last pyrotechnic cartridge as long as an object
was in a state „kn“ before this state, or it can be described as probability of a
round failure occurrence P B = 1 − P (B ) as long as an object was in a state 0
before this state and all pyrotechnic cartridges are eliminated from the
possibility to be used.
Transitions among different states as well as absolute probability might be put in the following formulae:
P(0) = P(B ) + P C k1 0 + P C k2 0 + P C k3 0 + ... + P C k1n 0
()
(
) (
) (
)
P(m1 ) = 1 − P(B )
P(mm ) = (1 − P(B )) + (1 − P(C ))
P(1) = 1
(
Transition probabilities are described using a matrix of transition probabilities P
p 01 ⎤
⎡p
P = ⎢ 00
⎥
⎣ p10 p11 ⎦
)
(3)
(4)
(5)
(6)
(7)
The arrows in figure 1 describe that the transition probability may occur with positive value. If we know the
form of transition probability matrix P and the original initial distribution of variable pi(0), then we can express
the absolute probability of random variable pi(n) as follows:
pi ( n ) =
p k ( 0 ) p ki ( n ),
i∈I
(8)
∑
k∈I
This formula can also be expressed in matrix form as follows:
n
P( n ) = P( 0 )P
(9)
We might describe the behaviour of the item in a stationary state in terms of probability using limit probabilities
pj defined as follows:
p j = lim pij ( n ),
j∈I
(10)
n →∞
The importance of limit probabilities lies in the expressing of the weakening of initial conditions. With the
help of this statement, we can get quite an exact picture about the behaviour of our observed item. We are either
happy enough to know that after going off the initial condition, the item will stay in one state with certain
probability. Or we may use the help of absolute probabilities to determine in which state the item will be after
going off a specific number of some measured units. This way allows us to get the dynamic (in time) picture
about the observed object.
4. Continuous distribution of observed variable and dependability
As we described the counting variable relating to the observed item above, we may also use the continuous
variable for getting a picture about the object behaviour. We are looking for a random function NF X(t), where
X(t) takes values from the set I={0,1,2}. We call the items from set I the states of the observed process. If the
parameter involved time (for instance t = < 0,∞), then we call the random function NF X(t) a Markov chain with
continuous parameter. We also call such a chain homogenous if the following formula is valid:
pij(s,t) = pij (0,s-t) = pij(t-s) ; s < t
(11)
It is clear from the above mentioned formula that the transition probabilities among each state are
dependent on the difference of arguments t-s and are dependent on the arguments t and s themselves. Such a
model is valid for those items and systems which are not capable of performing any operation even in a reduced
mode when a failure occurs. From a state point of view, they immediately transfer from state “0” – operating
state to state “1” – disabled state. This form is the most frequently used and for those items or systems with
partial performance capabilities is extended to at least one mean state. Items or systems behaving in this way
are not very suitable for us, due to the potential danger of a complex inability to perform any function in the
case of failure. The transitions among states might be described using either probabilities or rates (as displayed
bellow). Any transition among states may occur and the model has the following form:
113
As in the previous part with the counting parameter, we use the same description for states. The
assignment “0” means that the item/system is in an operational state and the assignment “1” means that the
item/system is in a disabled state. Such a description may be applied on different complete systems (e.g.
vehicles, weapon systems) or subsystems (e.g. engines) in the frame of military equipment. We are also able to
create plenty of different scenarios for each state description.
For transition rates of the following form:
For i = 0 and j = 1:
1
1
=
(12)
qij =
MTBF E P ( X )
where E P (MTBF – Mean Time Between Failures) – is the mean value of time to failure and i∈{0;1}, j∈{0;1},
whereas j ≠ i.
For i = 1 and j = 0:
1
1
=
(13)
q ji =
MTTR E O ( x )
where EO (MTTR – Mean Time To Reparation) – is the mean value of time to repair and i = 0, j = 1.
We resume following part of the above mentioned mathematical notation. The following formula is valid
for the Markov chain with continuous parameter. We define the transition rate as follows. Let h denote an
increment of the argument t, then qij is
pij ( h )
, for i ≠ j
(14)
qij = lim
h →∞
h
Whereas pij denotes a transition probability from state i to state j during an interval with length h, then we call
the value qij a transition probability from state i to state j. Using formula (14) the following is also valid:
pij(h) ≈ qij.h
(15)
If the pii(h) denotes transition probability from state i to state j during a time interval t, then we define the value
qi, where
1 − p ii ( t )
qi = lim
(16)
h→∞
h
is the transition rate from state i and we have qi = -qii. Using formula (15), the following form is also valid:
pii(h) ≈ 1- qi.h
(17)
Values qi and qij also fulfil the condition
qi = ∑ q ij , for all i ∈ I
(18)
j∈I , j ≠ i
where I is a set of states considered I∈{0;1;2;…}.
We also would like to introduce the equations for transition probabilities calculation. The forms are as follows:
′
p ij ( t ) = ∑ p ik ( t ).q kj ,
for i,j ∈I
(19)
k∈I
We also would like to introduce the equation system for absolute probabilities calculation. The forms are as
follows:
′
p i ( t ) = ∑ p k ( t ).q ki ,
for all k, i ∈ I
(20)
k∈I
It is necessary to know the particular transition rates among states for exact calculation of the above mentioned
differential equations. These equations are tused o give us exact information about the system and especially
what time the system will be in a particular state.
We see as suitable using the theory of “Inherent availability of complex system composed from many
mutually independent components” for each measures (like the transition rate for instance) calculation. The
results of these differential equations will give us the transition probabilities as well as the absolute probabilities
for expressing what time the system will be in what state. Such a piece of information is exactly related to the
dynamic dependability measures. Our decision making would be much harder without this kind of information.
That is why we do appreciate such procedures for dynamic dependability indication especially when considering
military vehicles.
5. Method applicability example
The methods applicability has been proven on two main military projects carried out in the Czech Armed Forces
during recent years. The first of them has been the main battle tank modernisation project. During the
114
preparation phase, there should have done several analyses on the impact of modernising items. Although the
analysis of power pack and other technical measures had been evaluated well, the other analyses (among others
the dependability) had not been carried out properly. While testing tactical and technical capabilities of the new
modernised version of the tank, many unexpected failures occurred during this phase. Some of these failures
might have lead to the loss of the vehicle either due to immobility or due to the loss of other key combat
functions both on the battle field and in the training phase. The result was to stop the overtaking and operating
the test version until the causes of failures could be found and removed. Using the method presented here (for
the continuous variable) the dependability measures have been improved and efficiency has increased.
The second example of an application of the method is the new project being conducted for the air force. We
talk about new two-barrel automatic cannons for sub-sonic fighters. The cannon is a very complex
mechatronics system using two key parts which behave in the counting variable way – ammunition and
pyrotechnical cartridges. As the project has still been in the testing phase of operation, the method has been
serving to diminish the frequency of failures which occur during observed periods and to implement
construction changes for dependability improvement. Again, the application of the method has proved its
success.
Both of the examples presented involve very complex procedure to carry out. The detailed description is
much wider than this contribution allows. For further information or for personal interest, please contact the
author of this contribution.
6. Conclusions
This paper describes the procedures which are suitable for dynamic dependability assessment. We have been
desperately looking for new and progressive methods which allow us to get a more precise view on military
(battle) equipment. The more information about such equipment we have, the more successful the possible
deployment might be.
One of the things we have to take into account and not appear like it does not interest us is risk. The risk is
very high both in training time and in real deployment as well as in the risk profile. The first part of the paper
deals with the basic understanding of risk and elementary formulas for its expression. The following parts show
the dynamic dependability assessment and investigation both for counting and for continuous situations. We
need to be aware when using each procedure to respect the conditions of the particular procedure.
Both of the procedures shown above have been proven in the frame of the Czech Armed Forces on
respective equipment. These examples have confirmed the ability of the mathematical procedures to express the
system behaviour in terms of the dynamic dependability. The results corresponded with reality as well as with
our expectations.
Acknowledgement
This contribution has been made with support of “Research Purpose Fund” of Faculty of Military Technology
Nr. 0000 401, University of Defence in Brno.
References
Kropáč, J. (2002) Vybrané partie z náhodných procesů a matematické statistiky, Brno: VA v Brně, Skripta S-1971
Kropáč, J. (1987) Základy náhodných funkcí a teorie hromadné obsluhy, Brno: VAAZ, Skripta S-1751/A
Vališ, D. (2003) Analysis of vetronics application consequences onto military battle vehicles dependability, Brno: VA v Brně,
Dissertation thesis
Vališ, D. (2005) Fundamentals of description, perception and value determination of Risk, Liberec: Technical University 2005
Vališ, D. (2006(a)) Contribution to Reliability and Safety Assessment of Systems, Sborník příspěvků konference – Opotřebení
Spolehlivost Diagnostika 2006, Brno: Universita Obrany, 31. říjen – 1. listopad 2006, str. 329 - 337, ISBN 80-7231-165-4
Vališ, D. (2006(b)) Assessment of Dependability of Mechatronics in Military Vehicles, Sborník příspěvků konference – Opotřebení
Spolehlivost Diagnostika 2006, Brno: Universita Obrany, 31. říjen – 1. listopad 2006, s. 309 - 319, ISBN 80-7231-165-4
Vališ, D. and Vintr, Z. (2006) Dependability of Mechatronics Systems in Military Vehicle Design, Proceedings of the European Safety
and Reliability Conference “ESREL 2006” (September 18 – 22, 2006, Estoril, Portugal), London/Leiden/New
York/Philadelphia/Singapore: Taylor&Francis Group 2006, 1703 - 1707, ISBN 10 0 415 41620 5
115
Weak sound pulse extraction in pipe leak inspection using stochastic resonance
Yang Dingxin, Xu Yongcheng
The Institute of Mechatronics Engineering, National University of Defense Technology, Changsha, P.R.China
Zip Code: 410073
[email protected]
Abstract: Acoustic methods are widely used in industrial pipe leak inspections. A series of sound waves
transmit through the pipe. When there is a leak at some position in the pipe, a sound pulse containing the
position information of the leak will emerge in the received sound wave signal. Due to the environmental noise
or the minority of the leak, the amplitude of the sound pulse is usually very weak. It is important to extract the
weak sound pulse from the noisy signal. Stochastic resonance (SR) is a kind of nonlinear phenomenon
occurring in special nonlinear systems subjected to noise and other excited signals, which is reported to be
useful in enhancing digital signal transmission and detecting weak periodic signals. In this paper, the model of
SR is applied to the weak sound pulse extraction from the received sound signal. We detail the principle and
algorithm of the SR model for extracting weak aperiodic pulses. A simulation scheme and results are presented.
Actual leak inspection experiments of a PVC pipe with small leaks are carried out using the SR model to extract
the weak sound pulse. The results show that SR provides a novel and effective way for weak sound pulse
extraction in pipe leak inspection.
1. Introduction
Pipelines are widely used in metallurgy, petroleum, water supply and sewage industries. There are often cracks
and leaks occurring in the pipes because of the extreme environmental conditions. Among acoustic methods,
the use of sound pulse for leak inspection is known to be a novel one, which has the properties of fast inspection
speed, high efficiency and has no constraint on the material of the pipes.
When a series of sound waves transmit through the pipe, if there exists a leak at some position in the pipe,
the reflecting sound wave containing the position information of the leak will reverse back to the transmission
end. A sound pulse representing the leak will emerge in the received sound wave signal. As the speed of sound
transmission in a certain pipe is changeless, we can work out the position of the leak location in the pipe by
processing the received sound pulse signal with a computer. Due to the environmental noise or the minority of
the leak, the amplitude of the sound pulse is usually very weak. The sound pulse is often submerged in varieties
of noises and is hard to distinguish. Junming & Zingxin (2001) applied wavelets to process the weak sound
pulse. Some weak sound pulse signals are extracted from noisy backgrounds using wavelets, but the wavelet
method is sensitive to the selection of basic wavelet functions. Different basic wavelet functions can bring forth
entirely different results, and in the case of strong noise, the effect is not good. Furthermore, as a kind of
characteristic signal, weak aperiodic pulses often occur in weak chemical spectrum extraction, singularity
detection etc. As a result, weak sound pulse extractions from the noisy signals are getting more and more
attention.
SR is a nonlinear phenomenon generally occurring in dynamical bistable systems excited by noisy periodic
signals, which was first introduced by Benzi, Sutera & Vulpiani (1981) to explain the periodic switching of the
ancient Earth’s climate between ice ages and periods of relative warmth. Through SR, the signal-to-noise ratio
(SNR) of weak signals immersed in heavy noise can be greatly improved if the nonlinearity, weak signal and
noise satisfy some conditions. Initially, SR was limited to treating weak periodic signals, and the aperiodic
signals were rarely involved. In recent years, it has been found that SR can also enhance the transmission of
weak aperiodic signals through nonlinear bistable systems. Collins, Chow & Imhoff (1995) introduced
aperiodic SR in research on the response of excitable neurons to weak aperiodic signals. Neiman &
Schimansky-Geier (1994) discussed the harmonic noise transmission through bistable systems. The weak
chemical spectrum signal (a kind of aperiodic pulse wave submerged in noise) is extracted by means of the
aperiodic SR in bistable systems, see Wang Liya et al. (2000) and Pan et al. (2001).
In this paper, we use the SR algorithm of bistable systems to extract the sound pulse in leak inspections of
pipes. Firstly, a stochastic resonance numerical simulation algorithm is proposed to meet the detection of weak
aperiodic pulse signals. The effect of the parameters of the nonlinear system on the performance of the
algorithm is discussed in detail and the optimisation of these parameters is studied. Finally, the SR algorithm is
applied to PVC pipe leak inspections: extracting the weak sound pulse representing a small hole in the pipe
from heavy noise.
116
2. Principle and algorithm of SR model for extracting a weak aperiodic pulse
The nonlinear bistable system modulated by signal and Gaussian white noise has been extensively exploited in
the study of SR, which is defined by nonlinear Langevin equations as
x&= −U ′( x ) + K gp ( t )
p (t ) = u (t ) + n(t )
(1)
< n ( t ) >= 0; < n ( t ), nΓ ( t ) >= 2 Dδ ( t − t )
'
where U ( x )=- a / 2 x 2 + b / 4 x 4 is a double-well potential with positive parameters a and b characterizing the
system, which has an unstable maximum at x = 0 and two stable minimum points at x = ± a / b respectively.
The potential barrier height is ∆U = a 2 / 4b . p(t) denotes the input signal embedded in a noisy background with
weak signal u(t) and noise n(t). n(t) is zero mean Gaussian white noise with variance σ 2 = 2D . K is the
adjustable parameter to normalise the input signal. When SR takes place, the input signal, noise and nonlinear
system cooperate well and the signal will extract energy from the noise. Thus, the strength of the signals will
increase and that of the noise will decrease. The obtained output signal of the system will have a better SNR
when compared with the input.
When the input signal is a sinusoid, u ( t ) = A sin ω t , according to the adiabatic theory of SR, the SNR of the
output signal can be derived from Equation (1), see Gammaitoni (1998), as
2 A ∆U
2
SNR =
D
2
−
∆U
e
(2)
D
It is clear that the SNR of the output signal can be improved by modulating the potential barrier ∆U and the
noise strength D. Since ∆U is decided by system parameters a and b, the values of a and b have a substantial
effect on the performance of detection. The same conclusion can be obtained for the aperiodic signal. In this
work, no external noise will be added. Only the parameters of the system are adjusted to match the aperiodic
pulse signal and intrinsic noise to achieve SR. The input signal is normalised over [-1,1] by adjusting factor K.
Therefore, all samples will have the same strength in the nonlinear system. Until now, it has been difficult to
conduct accurate analytical analysis on the aperiodic SR behaviour of bistable systems subjected to aperiodic
random signals. As a result, numerical simulation is adopted here for detecting weak aperiodic pulse signals.
The discrete Langevin equation (1) is approximately solved using the Runge-Kutta method. The algorithm can
be described as follows;
1
xn +1 = xn + ( k1 + 2k 2 + 2k3 + k 4 )
n = 0,1,L N − 1
(3)
6
k1 = h[axn − bxn + Kpn ]
3
k 2 = h[a ( xn +
k3 = h[a ( xn +
k1
2
k2
2
) − b ( xn +
) − b( xn +
k1
2
k2
2
) 3 + Kpn ]
(4)
) + Kpn +1 ]
3
k 4 = h[a ( xn + k3 ) − b( xn + k3 )3 + Kpn +1 ].
where, xn and pn denote the nth sample of x(t) and p(t) respectively, h denotes time step, the reciprocal of
sampling frequency fs, and N is the number of sampling points.
A cross-correlation based measure that considers the correlation between the stimulus signal and the system
response was usually used to evaluate the quality of the aperiodic SR. This is termed as the power norm C0.
Assume the signal is p(t) and the output is x(t), then C0 is given by
C0 = [ p (t ) − p (t )][ x (t ) − x (t )]
(5)
where, the over bar denotes an average over time. Based on the power norm, the normalised power norm C1 is
given by
C1 =
C0
[ p (t ) − p (t )]
2
117
(6)
[ x (t ) − x (t )]
2
From a signal processing perspective, maximising C1 corresponds to maximising the shape matching
between the input stimulus p(t) and the system response x(t). As such, this measure enables one to quantify the
detection performance.
3. Numeric simulation performance of SR model
A numerical simulation method is adopted in this research for detecting a weak aperiodic pulse signal. The
effect of the system parameters on the detection performance and the constraint condition for the weak aperiodic
pulse signal itself are also investigated. Simulated aperiodic pulse signals were produced according to the
following equation;
⎡ ⎛ t − t0 ⎞ 2 ⎤
u (t ) = h0 exp ⎢ε ⎜
⎟⎥
⎣ ⎝ W ⎠⎦
(7)
There are three parameters to control the waveform of the aperiodic pulse signal. In Equation(6), h0 is the
peak height, W denotes the peak half-width, ε is used to control the attenuation speed of the pulse signal. t and
t0 denote timea and the peak position, respectively.
The original aperiodic pulse signal with h0 = 0.8, W = 0.1, t0 = 0.3, ε = −2.7 is shown in Figure 1(a), and the
mixed input signal added with Gaussian white noise with = 1 is shown in Figure 1(b) as an example where,
the sample frequency is fs =1000Hz and the number of sample points is N = 1000.
Figure 1. Simulated aperiodic pulse signal and the mixed input signal with noise
3.1 The effect of system parameters a and b on the performance of detection
In the numerical simulation, the proposed Runge-Kutta algorithm is used to solve the aperiodic pulse signal
detection model. The parameters a and b of the nonlinear system define the height of the potential barrier and
potential’s profile. Therefore, a and b have a substantial effect on the quality of the output signal. A Normalised
power norm C1 is used to quantify the detection performance.
With system parameter a = 10 is invariable, the normalised power norm C1 obtained from Equation (6) for
different system parameter b is shown in Figure 2, where h0 = 0.8, W = 0.1, t0 = 0.3, ε = −2.7 , σ =1, and the xcoordinate is logarithmic.
It can be seen from Figure 2 that the normalised power norm C1 initially decreases with increasing b, and
then reaches the minimum. Then C1 increases with increasing b until a maximum is attained, and C1 decreases
slowly again. System parameter b varies in a large extension. Therefore, the optimisation of b should avoid
falling in the valley of the curve. In general, a proper small b is fit for the detection. Figure 3 presents the
obtained results of C1 with different a for the same dataset used in Figure 2, while b = 0.01 holds constant. The
x-coordinate is also logarithmic. It is clear that C1 has a single maximum with the increase of a. The response
speed of system (1) is mainly decided by a, so the optimised a should match the change speed of the signal to
improve the SNR of the output signal. In Figure 3, C1 reaches the maximum when a is about 16.
Figure 2. Normalized power norm C1 to parameter b
Figure 3. Normalized power norm C1 to parameter a
In practice, the optimisation of the system parameters is essential and complicated. Initially, a small value
for b should be considered. In addition, more attention should be given to the selection of parameter a. The
match with the speed of the signal should be considered and a is often selected using many attempts and
practical experience.
118
3.2 The effect of signal parameters h0 and W on the performance of detection
In the following, the effect of signal parameters h0 and W on the performance of detection is studied and the
constraint conditions for signal parameters h0 and W are obtained. Firstly, the parameter h0 is considered. With
system parameters a = 50, b = 0.01, W = 0.1, t0 = 0.3, ε = −2.7 , σ =1 invariable, the normalised power norm C1
obtained from Equation (6) at different peak height h0 is shown in Figure 4. It is obvious that C1 increases
smoothly with an increasing peak height h0. In order to distinguish the weak aperiodic pulse signal from the
noise, C1 must be not less than a certain value for the output signal to be of a decent quality. According to the
numerical experiments, C1 must be greater than 0.5, therefore from Figure 4, the peak height must not be less
than 0.3.
Next, the effect of signal parameter W on detection performance is considered. With system parameters a =
50, b = 0.01, h0 = 0.8, t0 = 0.3, ε = −2.7 , σ =1 invariable, the normalised power norm C1 obtained from Equation
(6) at different peak half-width W is shown in Figure 5. The curve of C1 in Figure 5 is similar to that in Figure 4.
C1 increases monotonously with an increasing peak half-width W. An increase in W will enhance the detection
performance and a reduction in W will have a substantial influence on the quality of the output signal.
According to the numerical experiment, C1 must not be less than 0.05 for the weak aperiodic pulse signal to
stand out from the noise background.
Figure 4. Normalized power norm C1 to peak height h0
Figure 5. Normalized power norm C1 to peak half-width W
3.3 Example of weak aperiodic pulse signal detection
With W = 0.1, t0 = 0.3, ε = −2.7 , σ = 1 and h0 = 0.5, 0.8, 1.2, and 1.5 respectively, the mixed input signals are
shown in Figure 6. In accordance with the discussion above, the system parameters are taken as a = 10 and b =
0.01. The output signals obtained from Equation(6) are shown in Figure 7.
Figure 6. Simulated mixed input signal
(a) h0=0.5 (b) h0=0.8 (c) h0=1.2 (d) h0=1.5
Figure 7. Obtained output signal
(a) h0=0.5 (b) h0=0.8 (c) h0=1.2 (d) h0=1.5
From Figure 7, it can be seen that the weak aperiodic pulse signal is extracted from the strong noise
background and there is good shape-matching between the output x(t) and the signal p(t).
4. PVC pipe experiment
When a series of sound waves transmit through the pipe, if there exists a leak at some position in the pipe, a
weak sound pulse representing the leak will emerge in the received sound wave signal. Because the speed of
sound transmission in a certain pipe is changeless, we can work out the position of the leak along the pipe by
processing the received sound pulse signal with a computer. Due to the environmental noise or the minority of
the leak, the amplitude of the sound pulse is usually very weak. The sound pulse is a kind of aperiodic pulse
signal. In order to examine the effectivity of the above stochastic resonance algorithm, sound pulse leak
119
inspection experiments were carried out on the PVC pipe. The PVC pipe has a diameter of 25 centimetres and a
length of 4 meters. We drilled a very small hole (1.2 mm in diameter) in the pipe. We sent a series of sound
waves from one end of the pipe using EEC-16/XB acoustic leak inspection instrument. Because of the drilled
hole, in the sampled retuned sound wave signal, an aperiodic sound pulse emerged. The position of the sound
pulse represented the position of the hole in the PVC pipe.
Figure 8(a) illustrates the received sound wave signal without noise and the x-coordinate is the sample
point. The sound pulse peak is located near the 30th sample point, which represents the position of the leak in
the pipe. Figure 8(b) illustrates the received sound signal with noise (approximately σ =1). It is hard to
distinguish the sound pulse peak from figure 8(b) and the position of the hole cannot be determined. We input
the sampling data from Figure 8(b) into Equation (3), with parameters a = 0.4, b = 0.4, and time step h = 1. The
output signal of Equation (3) processed by the stochastic resonance algorithm is shown in figure 9.
Figure 8. Received sound wave signal
Figure 9. Output signal of Figure8(b) through SR process
In Figure 9, it is obvious that there is a hop between the two potential wells of the bistable system, thus the weak
sound pulse representing the hole position in the pipe rises from the noise signal. The peak position is near the
30th sample point in accordance with the noiseless signal of Figure 8(a).
5. Conclusion
The sound pulse method is fit for fast leak inspection in metal and non-metal pipes. It is widely applicable in
the online leak inspection of heater tubes, condenser tubes, heat exchanger tubes etc. The aperiodic
characteristic signal containing the information of the leak is often submerged in varieties of noises. The SR
method proposed in the paper is a new attempt at solving the problem. The simulation and experimental results
show that the SR method can improve the weak aperiodic sound pulse detection ability, and provide a novel
means of weak sound pulse extraction.
References
Benzi, R., Sutera, A. and Vulpiani, A. (1981) The mechanism of stochastic resonance, J. of Physics A: Mathematical and General, 14,
453-457
Collins, J., Chow, C. and Imhoff, T. (1995) Aperiodic stochastic resonance in excitable systems, Physics Review E, 52, R3321-R3324
Gammaitoni, L. (1998) Stochastic resonance, Reviews of Modern Physics, 70, 223-287
Heneghan, C. and Chow, C. C. (1996) Information measures quantifying aperiodic stochastic resonance, Phys. Rev. E, 54, R2228-R2231
Junming, L. and Zingxin, Y. (2001) The application of wavelet analysis in sound pulse processing, Nondestructive Testing, 3, 34-35
Neiman, A. and Schimansky-Geier, L. (1994) Stochastic resonance in bistable systems driven by harmonic noise, Phys. Rev. Lett., 72,
2988-2991
Pan, Z. et al. (2003) A new stochastic resonance algorithm to improve the detection limits for trace analysis, Chemometrics and
Intelligent Laboratory Systems, 1362, 1-9
Wang, L. et al. (2000) A novel method of extracting weak signal, Chemical J. of Chinese Universities, 21, 53-55
120
An empirical comparison of periodic stock control heuristics for intermittent
demand items
M. Zied Babai, Aris A. Syntetos
Centre for Operational Research and Applied Statistics, Salford Business School, University of Salford,
Maxwell Building, The Crescent, Manchester M5 4WT, UK
Abstract: Intermittent demand patterns are common amongst spare parts. Typically, the inventories related to
intermittent demand SKUs are managed through periodic stock control solutions, though the specific policy
selected for application will depend upon the degree of intermittence (slow/fast intermittent demands)
associated with the SKUs under concern. In this research, the performance of some periodic stock control
heuristic solutions that are built upon specific demand distributional assumptions is examined in detail. Those
heuristics have been shown to perform well for differing spare parts demand categories and the investigation
under concern allows insight to be gained on demand classification related issues.
Keywords: Spare parts management, intermittent demand, stock control, empirical analysis
121
Minimising average long-run cost for systems monitored by the np control chart
Shaomin Wu 1, Wenbin Wang 2
1. Sustainable Systems Department, School of Applied Science, Cranfield University, Cranfield, Bedfordhire,
MK43 0AL, UK
2. Maxwell 603, Salford Business School, University of Salford, 43 The Crescent, Manchester, M5 4WT, UK
Abstract: This paper formulates the average long-run cost for a system monitored by an np-control chart. It is
assumed that the system has two types of failures: minor failure and major failure. If a minor failure occurs, the
system can still operate but a signal from the np-control chart indicates that the system is out-of-control. If a
major failure occurs, the system cannot operate. The system with a minor failure can be restored to a better state
by a maintenance activity. In this paper, the optimization of maintenance policies for such a system is presented.
Geometric processes are utilized for modelling life times and repair times. The average cost per time unit for
maintaining the systems is obtained. Numerical examples and sensitivity analysis are used to demonstrate the
applicability of the methodology derived in the paper.
Keywords: Quality control, reliability, control charts, repairable systems, geometric processes
1. Introduction
This paper aims to optimise economic cost for maintaining a manufacturing system by the utilisation of
statistical process control and the renewal process. A system monitored by an np-chart may have three states:
in-control state, out-of-control state and failure state. An in-control state indicates that the system functions
without any problem; an out-of-control state indicates that the system can be disrupted by the occurrence of
events called assignable causes but it still functions; a failure state indicates that the system fails to function. If
repair is carried out when the system is in either the out-of-control or failure state, the system can be brought
back to the in-control state.
Maintenance policy design for such systems has drawn plenty of attention since Girshick & Rubin (1952).
Research in the area always assumed both the times until the occurrence of assignable causes and the times
between failures to be independent identically distributed exponential random variables. The assumptions may
hold for some manufacturing systems which can be repaired as good as new. In some scenarios however, the
assumption may be violated because the system may deteriorate with time and cannot be repaired as good as
new, see Schneeweiss (1996). Models describing these deteriorating systems include non-homogeneous Poisson
process (NHPP) models (Girshick & Rubin (1952)), Brown-Proschan imperfect repair models (Coetzee (1997))
and generalized renewal process models (Brown & Proschan (1983) and Lam (1988a)).
Control charts in statistical process control are normally applied to monitor the system when the assignable
causes are not observable. Samples from the process output are taken at regular time intervals and the quality
characteristic of the sample items is measured. Statistical inference about the state of the process and the
assignable cause in effect, if any, are drawn based upon those measurements. In this paper, the optimization of
maintenance policy for a system monitored with an np-chart for the system is studied. Geometric processes are
applied to analyze the expected time and cost, and the average cost per time unit for maintaining a system. We
then investigate the situations where the geometric process can be applied to model the system failure
process. .The paper starts with the assumptions in Section 2 and Section 3 analyzes the average cost per time
unit for maintaining the system. Section 4 presents data experiments, and the last section draws concluding
remarks.
2. Assumptions
Consider a manufacturing system with outputs characterised by a discrete countable process. In order to detect
assignable causes for defective items, an np-control chart is used to monitor the system by taking samples of
size n at fixed intervals of h time units. The number of defective items among the samples is observed. This
provides information revealing possible problems (or assignable causes) with the system. If the number of
defective items is larger than a specific number, one can statistically infer that an assignable cause may exist in
the system (or the system is in the out-of-control state) and the specific number is called an out-of-control alarm.
If the number is smaller than a specific number, one can infer that the system is in the in-control state. The
following assumptions also hold;
• The system can either shift from the in-control state to the out-of-control state and then to the failure state,
or shift from the in-control state to the failure state without going through the out-of-control state. Neither
the failure state nor the out-of-control state can shift to the in-control state without any repair.
122
•
The defective rates of the product are p0 and p1 when the system is in the in-control state and in the out-ofcontrol state respectively, where p1 > p0.
• The chart is used to detect if the system is in-control or out-of-control, but the failure state does not need
detecting. If the np chart indicates that the system is in the out-of-control state, this means an assignable
cause may exist and an investigation will be carried out. During the investigation for the assignable causes,
the system continues running. When the system is confirmed as being in the out-of-control state, a minor
repair will be conducted which can bring the system back to the in-control state. The system is still
operating while it is under repair. Once the system fails, the system stops running and a major repair takes
place. The major repair can bring the system back to the in-control state. Both repair types may not restore
the assignable cause or the system to the condition it was prior to failure.
• A cycle is defined to be the time between two adjacent starts of the system after major repairs. A cycle may
include many in-control states and out-of-control states, but only one failure state.
For the mth cycle, the following notation applies;
• X1m and X2m are the times from the beginning of the in-control state to the occurrence of an assignable cause
and to the failure, respectively.
• Y1m and Y2m are the times on a minor and major repair, respectively.
• τm0 and τm1 are the times between the shift of the process to the out-of-control state and the first inspection
thereafter without any considerations of the possibility of failure and under the consideration of the
possibility of failure, respectively.
• τs is the investigation time for an assignable cause.
• Tm0 is the time from the first start of the system to the end of the last minor repair (see Figure 1). It may
include operating time, time on inspecting samples, time on investigating causes and time on minor repairs.
• Lm01 is the expected time from a start of the system to the end of its first adjacent minor repair in the mth
cycle. It may include operating time, time on inspecting samples, investigating causes and minor repairs.
• Tm01 is the expected time of the system's operation and being repaired from a start of the system to the end of
its first adjacent minor repair in the mth cycle. It only includes operating and repair time on a minor repair.
• Tm1 is the time from the start of the in-control state to the failure with an occurrence of the assignable cause
within the out-of-control state in the mth cycle (see Figure 1).
• Tm2 is the time from the start of the in-control state to the failure with an occurrence of the assignable cause
within the in-control state in the mth cycle (see Figure 1).
• pio is the probability that the system shifts to the out-of-control state from the in-control state.
• pif is the probability that the system fails within the in-control state.
• cs is the investigation cost for an assignable cause.
• cp is the profit per time unit when the system is in the in-control state.
• co is the profit per time unit when the system is in the out-of-control state.
• cr1 and cr2 are the costs per time unit for minor and major repairs, respectively.
• ci is the inspection cost per item of output.
• M is the number of past cycles.
−1
X1m, X2m, Y1m and Y2m are exponentially distributed with means 1 / λ1m−1 , 1 / λm
, 1 / µ1m−1 and 1 / µ 2m−1 ,
2
respectively.
Figure 1 shows an example of an np control chart, where the in-control zone is Z 0 = ( LCL, UCL) and the out-
•
of-control zone is Z1=(0, LCL)∪(UCL, n), UCL = np + δ np (1 − p ) and LCL = max(0, np − δ np (1 − p ) ) .
3. Expected cost and reliability indices
Denoting γj as the probability that the number of defective items found in the sample falls in Z0 when the
process is in state j, then for j = 0, 1, we have
γj =
C nk p kj (1 − p j ) n − k
∑
k∈Z 0
According to the above-mentioned assumptions, in each cycle, one of the following two scenarios occurs (see
Figure 2); Scenario 1: having passed some in-control states and out-of-control states, the system fails within the
out-of-control state, and Scenario 2: having passed some in-control states and out-of-control states, the system
fails within the in-control state.
123
3.1 Expected time in the mth cycle
−1
m −1
Denote λ1m = λ1m −1 , λ 2 m = λ m
, and µ 2 m = µ 2m −1 . From Figure 1, in
2 , 1 / λ m = 1 / λ1m + 1 / λ 2 m , µ1m = µ1
the mth cycle, the expected time is
E (Tm ) = E (Tm 0 ) + pio E (Tm1 ) + pif E (Tm 2 ) + E (Tm3 )
(1)
where pio = λ1m / λ m , pif = λ 2 m / λ m and the other items will be shown below.
3.1.1 Expected time E(Tm0)
Within time interval Lm 01 , the expected system's operating time within the in-control state is 1 / λ1m . The
expected number of samples taken within the in-control state is exp(−λ1m h) /(1 − exp(−λ1m h)) , see Lam (1988b).
The probability of the appearance of a false alarm which triggers unnecessary investigation efforts is ( 1 − γ 0 ).
Thus, for every sampling investigation the process remains idle for an average of ((1 − γ 0 )τ s + nt ) time units.
According to Duncan (1956), the density function of τ m 0 in the mth cycle is
f (τ m0 ) = (λ1m exp(−λ1m (h − τ m )) ) / (1 − exp(−λ1m h) )
(2)
Therefore, the expected time between the shift of the system to the out-of-control state and the first inspection
thereafter in the mth cycle is
E (τ m 0 ) = (λ1m h − 1 + exp(−λ1m h) ) / (λ1m (1 − exp(−λ1m h)) )
(3)
The process is at the out-of-control state before the kth inspection and an out-of-control alarm appears,
resulting in an investigation for the alarm. The probability of the occurrence of this event is (1 − γ 1 )γ 1k −1 . Time
on inspections is (k − 1)h + knt . As an investigation may uncover whether the out-of-control alarm is true, there
is only one minor repair. The time on a minor repair is 1 / µ1m . Hence, the whole expected time on this event is
∞
∑ [(k − 1)h + knt + τ s ](1 − γ 1 )γ 1k −1 + 1 / µ1m = (nt + hγ 1 ) /(1 − γ 1 ) + τ s + 1 / µ1m
k =1
Therefore,
Lm 01 = 1 / λ1m + ((1 − γ 0 )τ s + nt )(exp(−λ1m h) ) / (1 − exp(−λ1m h) ) + E (τ m 0 ) + (nt + hγ 1 ) / (1 − γ 1 ) + τ s + 1 / µ1m (4)
In Figure 1, Tm01 consists of time intervals as follows. The expected time in the in-control states is 1 / λ1m ,
the expected time in the out-of-control states before the first inspection being carried out is E (τ m0 ) , the
expected time on the inspection and the investigation for the first assignable alarm after the system shifting to
the out-of-control state is nt + τ s , and the expected time on a minor repair is 1 / µ1m . Therefore, the expected
time is
Tm 01 = 1 / λ1m + E (τ m 0 ) + nt + τ s + 1 / µ1m
(5)
However, all of the above events occur only if no failure occurs within the time interval Tm 01 . The probability
of this event is exp(−λ 2 m Tm01 ) . Therefore, the expected time is
∞
E (Tm 0 ) = ∑ kLm 01e − kλ2 mTm 01 = ( Lm 01e −λ2 mTm 01 ) /((1 − e −λ2 mTm 01 ) 2 )
(6)
k =0
Within time interval Tm1 , the expected time in the in-control state is 1 / λ m . The expected number of samples
taken during the in-control state is (exp(−λ m h) ) / (1 − exp(−λ m h) ) . The probability of the appearance of a false
alarm which triggers unnecessary investigation efforts is ( 1 − γ 0 ). Thus, for every sampling inspection, the
process remains idle for an average of ((1 − γ 0 )τ s + nt ) time units. The expected time between the shift of the
system to the out-of-control state and the first inspection thereafter in the mth cycle is
E (τ m1 ) = (λ m h − 1 + exp(−λ m h) ) / (λ m (1 − exp(−λ m h)) )
(7)
The probability of failure within ( (k − 1)h , kh ) is exp(−(k − 1) hλ 2m ) − exp(− khλ2m ) . The expected time
between the first inspection after the shift of the process mean to the out-of-control state and the end of Tm1 is
∞
∑ [(k − 1)h + knt ]γ 1k −1 (e −(k −1)hλ
2m
k =1
− e − khλ2 m ) = (hγ 1e − hλ2 m + nt )(1 − e − hλ2 m ) /(1 − γ 1e − hλ2 m ) 2
Hence, the total expected time in Tm1 is
E (Tm1 ) = 1 /λ m +((1 − γ 0 )τ s + nt )(e − λm h ) /(1 − e − λm h ) + E (τ m1 ) + (hγ 1e − hλ2 m + nt )(1 − e − hλ2 m ) /(1 − γ 1e − hλ2 m ) 2 (8)
124
Within time interval (0, t), the probability of that the system fails without any shift from the in-control state to
the out-of-control state is exp(−λ1m t )(1 − exp(−λ 2 m t )) . Therefore, the expected operating time from the first
start of the system to failure without any shift to the out-of-control state in the mth cycle is
∞
− ∫ tde −λ1mt (1 − e −λ2 mt ) = 1 / λ1m − 1 / λm . The probability of the appearance of a false alarm which triggers
0
unnecessary investigation efforts is ( 1 − γ 0 ). Thus, for every sampling inspection the process remains idle for an
average of ((1 − γ 0 )τ s + nt ) time units. The expected number of samples taken during the in-control state is
e − λ1m h /(1 − e − λ1m h ) − e − λm h /(1 − e − λm h ) , giving
(
E (Tm 2 ) = 1 / λ1m − 1 / λ m + ((1 − γ 0 )τ s + nt ) e − λ1m h /(1 − e − λ1m h ) − e − λm h /(1 − e − λm h )
Tm3 is the time required to repair the failure in the mth cycle. The expected time is
E (Tm3 ) = 1 / µ 2 m
3.2 Expected cost in the mth cycle
The expected cost during the mth cycle is
E (C m ) = E (C m 0 ) + p io E (C m1 ) + p if E (C m 2 ) + E (C m3 )
)
(9)
(10)
(11)
where, E(Cm j) is the expected cost incurred during Tm j, where j = 0, 1, 2, 3. Several parts to the cost may be
considered: profit of the system, investigation cost for out-of-control alarms, cost for repair and cost for
inspecting samples.
3.2.1 Expected cost E(Cm0)
The profit per time unit in the in-control state is c p / λ1m . The expected profit operating within the out-ofcontrol state is E (τ m 0 )(c p − c o ) + (nt + hγ 1 )(c p − co ) /(1 − γ 1 ) . The investigation cost for false out-of-control
alarms within the in-control state is ((1 − γ 0 )c s + ntci )(exp(−λ1m h) /(1 − exp(−λ1m h))) and the investigation cost
for out-of-control alarms within the out-of-control state is ntci /(1 − γ 1 ) + c s + c r1 / µ1m . Therefore, the expected
cost within Tm01 is
C m 01 = − c p / λ1m − E (τ m0 )(c p − c o ) − ( nt + hγ 1 )(c p − c o ) /(1 − γ 1 ) + ((1 − γ 0 )c s + ntc i )e − λ1m h /(1 − e − λ1m h )
+ ntci /(1 − γ 1 ) + cs + cr1 / µ1m
The expected cost within Tm0 is
(12)
∞
E (C m 0 ) = ∑ kC m 01e − kλ2 mTm 01 = C m 01e −λ2 mTm 01 /((1 − e −λ2 mTm 01 ) 2 )
(13)
k =0
The profit per time unit in the in-control state is c p / λ m and the expected profit when operating within the outof-control state is E (τ m1 )(c p − c o ) + (1 − e − hλ2 m )(nt + hγ 1e − hλ2 m )(c p − c o ) /((1 − γ 1e − hλ2 m ) 2 ) . The investigation
cost for false alarms within the in-control state is ((1 − γ 0 )c s + ntci )e − λm h /(1 − e − λm h ) and for alarms within the
out-of-control state is (1 − e − hλ2 m )ntci /((1 − γ 1e − hλ2 m ) 2 ) . Therefore, the expected cost within Tm01 is
E (C m1 ) = − c p / λ m − E (τ m1 )(c p − c o ) − (1 − e − hλ2 m )(nt + hγ 1e − hλ2 m )(c p − c o ) /(1 − γ 1e − hλ2 m ) 2
+ ((1 − γ 0 )cs + ntci )e − λm h /(1 − e − λm h ) + (1 − e − hλ2 m )ntci /(1 − γ 1e − hλ2 m ) 2
(14)
The profit per time unit in the in-control state is c p / λ1m − c p / λ m . The investigation cost for false alarms
within the in-control state is ((1 − γ 0 )c s + ntci )(e − λ1m h /(1 − e − λ1m h ) − e − λm h /(1 − e − λm h )) . Hence, the expected
cost within time interval Tm2 is
E (C m 2 ) = ((1 − γ 0 )c s + ntci )(e − λ1m h /(1 − e − λ1m h ) − e − λm h /(1 − e − λm h )) − (c p /λ1m − c p /λ m )
(15)
The expected time within Tm3 is shown in equation (10). Therefore, the cost incurred within Tm3 is
E (C m3 ) = c 2 r / µ 2 m
125
(16)
Proposition 1. The average cost per time unit for the operation of the system until cycle M is
M
M
m =1
m =1
E ( M ) = ∑ E (C m ) / ∑ E (Tm )
(17)
where, to optimise of the parameter settings of an np control chart, E(M) should be minimised for M → ∞ .
Unfortunately, as equation (17) is complex, an explicit solution is hard to obtain. Section 5 presents some data
experiments for further discussion on the optimisation.
4. Numerical experiments
4.1 A geometric process case
We take the geometric process as an example. The definition of the geometric process is as follows.
Definition 1. Given random variables ξ and ζ, ξ is stochastically greater than (less than) ζ if
Pr{ξ > ν } ≥ Pr{ζ > ν } for all real v: ξ ≥ st ζ or ξ ≤ st ζ
According to Ross (1996), a stochastic process {ξ i , i = 1,2,...} is stochastically increasing (decreasing) if
ξ i ≤ st (≥ st ) ξ i +1 for all i = 1,2,... .
Definition 2, (Lam (1992)). A sequence of non-negative independent random variables {ξi , i = 1,2,...} is
called a geometric process (GP), if for some τ > 0 , the distribution function of ξ i is F (τ i −1 x) for n=1,2,…
From definition 2, we can obtain:
• If τ > 1 , then {ξ i , i = 1,2,...} is stochastically decreasing: ξ i > st ξ i +1 , n = 1,2,...
• If 0 < τ < 1 , then {ξ i , i = 1,2,...} is stochastically increasing: ξ i < st ξ i +1 , i = 1,2,... , and
• If τ = 1 , then {ξ i , i = 1,2,...} is a renewal process.
A GP benefits from its simplicity for describing times between failures and times between repairs. Studies
involving repair models, maintenance policies and reliability indices include Lam (1992), Lam & Zhang (2003),
Zhang (2002) and Wu & Clements-Croome (2006). However, the GP suffers from an important limitation: it
can only model systems with a monotonously increasing, decreasing or constant failure intensity and real
intensities are usually more complicated (e.g. bathtub curves). In such a case, a single GP could not model the
whole life cycle of the system. A bathtub curve can be viewed as being comprised of three distinct periods; a
burn-in failure period with decreasing intensity, an intrinsic failure period with constant intensity and a wearout failure period with increasing intensity. GP’s can only be applied to model the system within one of the
three periods. Below, we define an extended GP for systems whose failure intensity exhibits a bathtub curve.
Definition 3, Wu (1994). A sequence of non-negative independent r.v.’s {Xn ; n = 1, 2,...} is called an
extended Poisson process (EPP), if for some α β ≠ 0, α, β ≥ 0, a ≥ 1 and 0 < b ≤ 1, the cdf of Xn is G((α an-1 + β bn-1)
x) and G(x) is an exponential cdf. α, β, a and b are parameers of the process.
1) If a = b = 1, then the EPP is an HPP.
2) If α an-1 ≠ 0 and β bn-1 = 0 (or α an-1 = 0 and β bn-1 ≠ 0) for n = 1, 2, ... then {Xn ; n = 1, 2,...} is a GP.
3) If α an-1 ≠ 0, a > 1 and b = 1, then {Xn ; n = 1, 2,...} can describe the periods from the intrinsic failure period to
the wear-out time period in a bathtub curve.
4) If a = 1, b < 1, and β bn-1 ≠ 0, then {Xn ; n = 1, 2,...} can describe the periods from the burn-in time period to
the end of intrinsic failure period in a bathtub curve.
5) If α an-1 ≠ 0, a > 1, 0 < b < 1 and β bn-1 ≠ 0, then {Xn ; n = 1, 2,...} can describe more complicated failure
intensity curves.
Assume that X 1m , X 2 m , Y1m and Y2 m are exponentially distributed with means 1 /((α 1a1m −1 + β1b1m −1 )λ1 ) ,
1 /((α 2 a 2m −1 + β 2 b2m −1 )λ 2 ) , 1 /((α 1 a1m −1 + β1b1m −1 )λ1 ) and 1 /((α 4 a 4m −1 + β 4 b4m −1 ) µ 2 ) respectively, where,
a j ≥ 1 , 0 < b j ≤ 1 and j = 1, 2, 3, 4 . Denote λ1′m = (α 1 a1m −1 + β 1b1m −1 )λ1 , λ 2′ m = (α 2 a 2m −1 + β 2 b2m −1 )λ 2 ,
λ m′ = λ1′m + λ 2′ m , µ1′m = (α 3 a 3m −1 + β 3 b3m −1 ) µ1 and µ 2′ m = (α 4 a 4m −1 + β 4 b4m −1 ) µ 2 . By replacing λ1m , λ 2m , λ m ,
µ1m , and µ 2 m with λ1′m , λ 2′ m , λ m′ , µ1′m , and µ 2′ m in the previous equations, all of the aforementioned results
can be extended. The following parameters (Table 1) are used in this section.
cs
0.9
co
cp
100
cr1
ci
0.1
cr2
λ1
0.01
2
λ2
10
µ1
0.04
µ2
0.008
h
5
0.004
t
0
p0
0.03
0.07
τσ
1
p1
0.08
Table 1. Parameter settings
126
4.2 A comparison between the GP and EPP
When GP’s are used, we assume the parameter values shown in Table 2 and when extended GP’s are used, we
assume the parameter values shown in Table 3.
a1
p1
1
0.03
b1
p2
1.1
0.08
a2
δ
1
1
b2
n
0.9
50
1
0.01
1
0
1
β1
β2
β3
β4
n
0
1
0
1
50
Table 2. GP parameters
a1
a2
a3
a4
p1
1
1.1
1
0
0.03
b1
b2
b3
b4
p2
α1
α2
α3
α4
δ
0
0.9
0
0.9
0.08
Table 3. Extended GP parameters
Figure 1. Average cost E(M) under GP
Figure 2. E(M) under extended GP
Figure 2 shows that cost per time unit changes monotonously when GP’s are used to model the system.
That may be oversimplified because the cost per time unit may be lower while the system is staying within the
intrinsic failure period of the bathtub curve. When extended GP’s are used, the average cost per time unit (see
Figure 3) shows a different shape from that in Figure 2. The change in the average cost per unit time in Figure 3
exhibits a bath-tub curve shape. This result may be more realistic than that based on GP’s as the average cost
per time unit should be lower when the system is within the intrinsic failure period.
5. Parameter sensitivity analysis
5.1 Comparing different values of δ
δ is a parameter in the upper (lower) control limit of an np chart. We use the parameter values shown in tables 1
and 3 (except δ) and adjust the value of δ. Letting δ = 1, 2 and 3, the average cost per time unit E(M) is shown in
Figure 4 for M ≤ 50 and Figure 5 for 70 ≤ M ≤ 100 . The interval 50 < M < 70 is omitted as the average cost
difference is too small for illustration within that interval. We can see from Figure 4 that when δ = 1, E(M) is
larger than that observed when δ = 3 whereas, the opposite is observed (in Figure 5) in the range 70 ≤ M ≤ 100 .
Therefore, we draw the conclusion that δ can be set to 3 when M ≤ 50 and set to 1 when 70 ≤ M ≤ 100 .
Figure 3. Relationship between E(M) and δ when M ≤ 50
Figure 4. Relationship between E(M) and δ when
70 ≤ M ≤ 100
5.2. Comparing different values of n
Another parameter in the upper (lower) control limit of an np chart is the sample size n. The parameter values
given in Tables 1 and 3 are again used here (except n) and the value of n is adjusted at increments of 5 in the
range 40 - 150. The effect on E(M) for the differing values of n is illustrated in Figure 5. From the figure, it is
clear that the cost is the smallest when n = 55 and the largest when n = 40. Therefore, n = 55 is a better choice for
the np control chart.
127
Figure 5. Cost changes under different n
6. Concluding remarks
Optimising the long-run average cost for monitoring manufacturing systems is an interesting research topic. A
typical assumption is that the system can be repaired to an ‘as good as new’ state. In reality, the assumption may
be violated when the system deteriorates over time. Research on the economic design of control charts for
deteriorating systems is therefore more useful.
This paper discusses the economic design of np control charts for systems that are not repaired ‘as good as
new’. The main contributions in the paper are:
(1) GPs are applied to the economic design of control charts for the first time.
(2) The average cost per unit time until a given cycle is obtained.
We obtained the average cost per unit time for np control charts under the assumptions that the times to the
adjacent occurrences of assignable causes, the times to failure and the times to repair in each cycle are GPs, and
we found that the average costs per unit time are increasing monotonously. We then borrowed the extended GP
which can be used to describe the entire lifecycle of a system and applied it to optimise the average cost per unit
time. The average cost per unit time shows a bathtub curve. Experimental results show that the average cost per
unit time of a given manufacturing system can vary under changes in the parameter values.
References
Girshick, M. A. and Rubin, H. A. (1952) Bayes' approach to a quality control model, Annals of Mathematical Statistics, 23, 114-125
Schneeweiss, W. G. (1996) Mission Success with Components not as Good as New, Reliability Engineering & System Safety, 52, 45-53
Coetzee, J. L. (1997) The role of NHPP models in the practical analysis of maintenance failure data, Reliability Engineering & System
Safety, 56, 161-168
Brown, M. and Proschan, F. (1983) Imperfect repair, J. of Applied Probability, 20, 851-859
Lam, Y. (1988a) Geometric processes and replacement problem. ACTC Mathematicae Applicatae Sinica, 4, 366-377
Lam, Y. (1988b) A note on the optimal replacement problem. Advances in Applied Probability, 20, 479-482
Duncan, A. J. (1956) The economic design of X-charts used to maintain current control of a process, J. of the American Statistical
Association, 51, 228-242
Ross, S. M. (1996) Stochastic Processes, Wiley
Lam, Y. (1992) Optimal geometric process replacement model, ACTC Mathematicae Applicatae Sinica, 8, 73-81
Lam, Y. and Zhang, Y. L. (2003) A geometric-process maintenance model for a deteriorating system under a random environment, IEEE
Transactions On Reliability, 52, 83-89
Zhang, Y. L. (2002) A Geometric-Process Repair-Model With Good-as-New Preventive Repair, IEEE Transactions on Reliability, 51,
223-228
Wu, S. (1994) Reliability analysis of a repairable system without being repaired "as good as" new, Microelectronics & Reliability, 34,
357-360
Wu, S. and Clements-Croome, D (2006) A novel repair model for imperfect maintenance, IMA J. of Management Mathematics, 17, 235243
WU, S. and Clements-Croome, D. (2005) Optimal maintenance policies under different operational schedules, IEEE Transactions on
Reliability, 54 (2), 338-346
128
Condition evaluation of equipment in power plant based on grey theory
Jian-lan Li *, Shu-hong Huang
Huazhong University of Science & Technology, Wuhan, 430074, China
[email protected]
Abstract: Condition evaluation of equipment is important for reliability centered maintenance (RCM) in power
plants. Based on the basic theory of grey system theoretics, a new concept of grey space relation is advanced
and defined in this paper. The grey space relation reflects the approximation degree of two sequences both in
distance and shape, which is a fixed quantity index that denotes the relation between sequences. A grey model
of condition evaluation for equipment in power plant based on the grey space relation is constructed in this
paper. By calculating the grey space relation between the condition parameter sequence and the rating condition
parameter sequence of evaluated equipment, the approximation degree of equipment’s condition and rating
condition is obtained, thus, the fix quantity evaluation of equipment is realized in this model. Finally, condition
evaluation of feed water pump is realised and good identification result are acquired, and the evaluated result
can be used as a support in making scientific maintenance decisions.
Keywords: Equipment in power plant, grey model, condition evaluation, grey space relation
1. Introduction
In the process of reliability centered maintenance (RCM) for equipment in power plant, a core step is to
evaluating the equipment’s current condition and its operational risk, which provides evidence for maintenance
decision-making. Recently, some scholars attempted to construct a model of equipment condition evaluation
using some mathematical method. Li et al. (2002) evaluated the steam turbine unit by simple weight and Gu et
al. (2004) judged equipment condition by calculating the degree of membership of the equipment’s impairment
grade. Both of the models achieve some success in evaluation, but the result is not good when some condition
character parameters produce severe deviation in the rating values.
Grey theory is a new subject founded by Professor J L Deng in 1982, which achieves the correct
description and effective control of a system’s rule by extracting valuable information from an uncertainty
system of small samples and poor information. Grey theory has now been widely applied in analysis, modelling,
forecasting, decision-making, programming of the system of society, economy, weather, zoology, water
conservancy, industry, agriculture, medicine etc and good effects have been achieved, see Wang et al. (2005),
Zhang & Liu (2006), Xu et al. (2006) and Chen & Li (2005) for details. The condition of equipment in a power
plant is commonly described by some condition parameters and it has the typical uncertainty characteristics.
The essence of equipment condition evaluation is to compare the condition parameters with rating parameters
thus, a grey relation can be used to achieve this intention. In this paper, a grey space relation, which reflects the
approximation degree of sequences both in relatively distance and geometry shape, is advanced and defined
using grey theory and a condition evaluation model of equipment in power plant is constructed. The model
considers the effect of some condition parameters changing an equipment’s condition by calculating the grey
relation between the condition parameters and the rating parameters of evaluated equipment thus, the fix
quantify evaluation of equipment in power plant is achieved. Finally, we test that the model can realise the
condition evaluation of equipment in power plant and provide support when making scientific maintenance
decisions.
The paper is organized as follows; in section 2, three grey relations that are grey distance, grey shape
relation and grey space relation are defined, and a model of condition evaluation of equipment in power plant
based on grey theory is constructed. In section 3, the condition evaluation of feed water pump based on the grey
model is realised. Section 4 concludes the paper.
2. Condition evaluation of equipment in power plant based on grey theory
2.1 Dimensionless condition character parameter of equipment
As the units of the parameters in the condition character parameter sequence are different, in order to evaluate
the effect of each parameter consistently, a dimensionless parameter is needed before calculating the grey space
relation of equipment in power plant. According to the plant information and expert experience, equipment’s
condition deteriorates rapidly when the condition character parameters deviate the rating value, especially for
some key parameters. Thus, the issue of how to exactly express this relationship between the parameter and
condition is very important during the dimensionless process. Li et al. (2002) and Gu et al. (2004) evaluate the
equipment’s condition by calculating the impairment grade using a simple linearity model, which obviously
134
can’t reflect the exact relationship between the parameter and the condition. A new dimensionless model of
condition character parameters is constructed in this paper.
k (x −x ) / x −x
yi = e i 0 T i
(1)
In Eq. (1), yi is the dimensionless evaluation index of condition character parameter, xi is the practice
measure value of condition character parameter, x0 is the rating value of the condition character parameter, xT is
the threshold value of condition character parameter and k is the increasing factor. The larger k is, the more
significant the effect of the parameter’s change on the equipment’s condition; k = 2 in this paper.
It can be seen from Eq. (1) that the relationship between the equipment’s condition and the deviation value
of the condition parameter is an exponential curve, which means that the equipment’s condition deteriorates
rapidly along with the deviation of practice value to rating value increases.
2.2 Index of equipment’s condition evaluation
A sequence X of n dimensions is constructed using some condition character parameters x(1), x(2),… , x(n), that
can represent the equipment’s character function. Let the condition character parameters in the rating state
construct a standard sequence X0 = {x0(1), x0(2),…, x0(n)} and the evaluated condition character parameters
construct an evaluated sequence Xi = {xi(1), xi(2),…, xi(n)}. The essential aspect of equipment condition
evaluation is to analyse the approximation degree between the operational value and the rating value of the
equipment’s character parameter. The more approximate the two values are, the better the equipment’s
condition. According to the basic theory of grey system theoretics, the condition evaluation of equipment can
be translated into calculating the grey relation of two sequences of Xi and X0. A large relationship means greater
equipment condition.
For evaluated sequence Xi and standard sequence X0 of equipment in power plant, the relationship of these
two sequences is related to not only the sequence’s approximation in geometry shape, but also the distance of
the sequences and the effect (that is the weight) of each parameter in the sequence. As such, the above factors
will be considered in calculating the grey relation of sequences.
Firstly, a real number γ (X0, Xi) is presented and defined in this paper as
1
(2)
γ (X 0 , X i ) =
n
1 + ∑ α (k )( xi (k ) − x 0 (k ))
k =1
where, α(k) is a weighting factor of condition character parameter x(k).
γ (X0, Xi) satisfies the 4 axioms of grey relationships that are normalisation, entirety, even symmetry and
approximation. As such, γ (X0, Xi) is the grey relationship of Xi to X0, which is called the grey distance
relationship and is abridged as γ 0i. The justification process is as below;
1) Normalisation
∵
0 ≤ α (k )( x i (k ) − x 0 (k ))
∴
0 < γ (X 0, Xi ) ≤1
∴
satisfies normalisation
2) Entirety
If X = {X s s = 0,1,2 Λ n}, then for any X s1 , X s 2 ∈ X ,
α (k )( x i (k ) − x s1 (k )) ≠ α (k )( xi (k ) − x s 2 (k ))
∴
satisfies entirety
3) Even symmetry
If X = { X 0 , X 1 } , then
α (k )( xi (k ) − x 0 (k )) = α (k )( x i (k ) − x1 (k ))
In the formula above, i = 1 in the left-hand side, i = 0 in the right, so γ ( X 0 , X 1 ) = γ ( X 1 , X 0 )
∴
satisfies even symmetry
4) Approximation
The smaller ( x i (k ) − x 0 (k )) is, the greater γ ( X 0 , X i ) is.
∴ satisfies approximation
This concludes the justification.
135
γ 0i reflects the approximation degree of sequences of Xi to X0 in distance by calculating the absolute
distance of the parameters in the two sequences. At the same time, the effects of each parameter are embodied
by weigh factors. The larger γ 0i is, the more Xi approaches X0 in distance. The value range of γ 0i is (0, 1).
In order to reflect the geometric approximation of sequences Xi and X0, a grey shape relationship ε (X0, Xi)
is defined (Liu et al. (1999)) which is abridged as ε 0i. We have
1 + s 0 + si
ε (X 0, X i ) =
1 + s 0 + si + si − s 0
X 00 = ( x 00 (1), Λ , x 00 ( n))
s0 =
n −1
∑ x00 (k ) +
k =2
si − s 0 =
1 0
x 0 (n)
2
X i0 = ( xi0 (1), Λ , x i0 ( n))
si =
n −1
n −1
1
∑ xi0 (k ) + 2 xi0 (n)
(3)
k =2
1
∑ ( xi0 (k ) − x00 (k )) + 2 ( xi0 (n) − x00 (n))
k =2
In Eq. (3), X 00 and X i0 are the zero image of the initial point of Xi and X0 separately. ε 0i reflects the
approximation degree of sequences of Xi and X0 in geometric shape, which relates with shape only and is
independent from distance, that is to say translation has no effect on ε 0i. The larger ε 0i is, the closer the
approximate Xi is to X0 in geometric shape. The value range of ε 0i is (0, 1).
In the evaluation process of equipment in power plant, the grey distance relationship reflects the
approximation degree of equipment’s evaluated condition to standard condition on the whole, but it can not
reflect exactly the effect on equipment’s evaluation when some parameters, especially key parameters, produce
substantial deviation in the rating value. For example, there are two cases of feed water pumps; one is that
vibration exceeds a dangerous value whilst other parameters remain normal and the other is that all parameters
do not exceed a dangerous value but do exceed the rating value. If only distance is considered, then the grey
distance relationship of the two cases may be the same. However, the feed water pump’s condition in the
former case is obviously worse than that of the latter. As such, when only the distance and shape of the
sequence are considered, the approximation degree of the evaluated parameters to the rating parameters can be
true reflected. According to Eq.’s (1) and (2), a grey space relationship ρ(X0, Xi) is advanced and defined in this
paper (the justification process of ρ(X0, Xi) is similarly to that of γ 0i), which is abridged as ρ0i. We have
(4)
ρ ( X 0 , X i ) = γ 0i ∗ θ + (1 − θ ) ∗ ε 0i
In Eq. (4), θ is the weighting factor of the grey distance relationship in the grey space relationship. As the
equipment’s condition is mainly deduced by the relative distance between sequences, the value of θ is
commonly in the range [0.5, 1]. θ is 0.7 in this paper.
It appears from Eq. (4) that ρ0i is a weighted composite of the distance and shape relationships. ρ0i is a fixed
quantity index which indicates the relationship between the sequences. The larger ρ0i is, the closer Xi is to X0
and the greater the equipment’s condition.
2.3 Evaluation of equipment condition
Equipment condition can be divided into 4 grades; good, common, bad and faulty, shown in table 1.
Equipment
condition
description
Good
Common
Bad
Faulty
Equipment condition is
good and can continue
operating
Parameters deviate
rating value, more
inspections to be taken
Equipment’s condition
deteriorates obviously,
faults shall be found
immediately
Equipment is in a faulty
condition and shall be
stopped to examine and
repair immediately
Table 1. Condition grades of generate electricity equipment
According to power plant rules, the character parameters of equipment in power plants have certain
permissible operational ranges and rating operational values. As such, the grades of equipment’s condition can
be assigned using information from the character parameters. 3 character parameter sequences X1, X2 and X3
are defined using rating operational values and permitting an operational range;
{x0i + ( xTi − x0i ) ∗ 0.3}
X1 :
X 2:
X 3:
{x0i + ( xTi − x0i ) ∗ 0.7}
{xTi }
136
(5)
In Eq. (5), x0i and xTi are the rating value and the maximum operating value of the equipment’s character
parameter, respectively.
According to Eq.’s (2) - (4), 3 grey space relations ρ 1, ρ 2, and ρ 3 are calculated. These are the
relationships between the character parameter sequences X1, X2 and X3 and the rating parameter sequence,
respectively. These 3 space relationships are called boundary space relationships and are the thresholds of 4
condition grades; good and common, common and bad, bad and faulty:
ρ 0i > ρ1
good :
common : ρ1 ≥ ρ 0i > ρ 2
(6)
ρ 2 ≥ ρ 0i > ρ 3
faulty : ρ 0i ≤ ρ 3
bad :
In Eq. (6), ρ 0i is the grey space relationship of the sequence of factual operational values to that of rating
operational values.
3. Condition evaluation of feed water pump
An example of a feed water pump is taken for equipment evaluation in this paper. A character parameter
sequence of the feed water pump is acquired from Gu et al. (2004), of the vibration of feed water pump, export
pressure of feed water pump, bearing metal temperature of feed water pump, export oil temperature of oil cooler,
temperature of airproof cooling water, lubricating oil pressure of feed water pump, shown in table 2. Sequence
Y0 is a rating sequence that is made of rating operational values of the feed water pump and Y1, Y2 and Y3 are 3
sequences that are made of factual operational values of the feed water pump.
According to Eq.’s (1) - (5), 3 grey boundary space relationships for the feed water pump are calculated as
ρ 1 = 0.684, ρ 2 = 0.478 and ρ 3 = 0.395 respectively. Similarly, the 3 grey space relationships between the
sequences Y1, Y2 and Y3 to Y0 are calculated, these are ρ 01 = 0.816, ρ 02 = 0.367 and ρ 03 = 0.547. Finally, the
equipment’s condition evaluation results are obtained according to Eq. (6) and shown in table 3. The diagnostic
results of the model are the same as that of the experts, which testifies toward the exactness of model especially
for condition Y2 (most of whose condition parameters are good except vibration which is out of the permissible
range) whose grey space relation is calculated to be ρ 02 = 0.367 < ρ 3. As such, it can be judged that Y2’s
condition grade is faulty. From this example, it appears that the equipment’s condition can be well judged when
some parameters produce severe deviation in the rating values.
permissible
range
weight
Y0
Y1
Y2
Y3
Vibration of feed water pump (µm)
Export pressure of feed water pump (Mpa)
Bearing metal temperature of feed water
pump (℃)
0~50
>16
60~70
0.2
0.2
0.15
20
16.7
65
20
16.5
65
52
16.7
68
35
16.7
65
Export oil temperature of oil cooler (℃)
35~46
0.15
40
40
40
42
Temperature of airproof cooling water (℃)
Lubricating oil pressure of feed water pump
(Mpa)
55~75
0.15
65
65
65
70
0.1~0.24
0.15
0.18
0.18
0.22
0.18
Table 2. Character parameter of water feed pump
grey space relation
evaluation result by model
evaluation result by expert
Y1
Y2
Y3
0.816
good
good
0.367
fault
fault
0.547
common
common
Table 3. Condition evaluation result of water feed pump
4. Conclusion
Based on grey system theory, the concepts of grey distance relationships and grey space relationships are
advanced and their mathematical models defined. The grey distance relationship is an index that reflects the
approximation degree of the sequences Xi and X0 in distance by calculating the absolute distance of the
137
parameters in the two sequences. The grey space relationship is a comprehensive quantity index that reflects the
approximation degree of sequences Xi and X0 both in distance and shape.
A condition evaluation model of equipment in power plant is constructed in this paper and the model has a
clear mathematical definition. In this model, fixed quantity evaluation is achieved by calculating the grey space
relationship between sequences of factual parameters and that of the rating parameter. In particular, the effect
on the evaluation of condition when some of the parameters produce severe deviations in the rating value is also
addressed.
3 conditions of feed water pump are evaluated using the model in this paper. The diagnostic result is
consistent with that of experts, which validated the model.
References
Chen, J. H., Sheng, D. R. and Li, W. (2002) A model of multi-objective comprehensive evaluation for power plant, Proceedings of the
Chinese Society for Electrical Engineering, 22 (12), 152- 155.
Chen, S. W. and Li, Z. G. (2005) Application of grey theory in oil monitoring for diesel engine, Transactions of CSICE, 23 (5), 476- 480.
Deng, J. L. (1985) Grey Systems, Beijing: National Defense Industry Press.
Gu, Y. J., Dong, Y. L. and Yang, K. (2004) Synthetic evaluation on conditions of equipment in power plant based on fuzzy judgment and
RCM analysis, Proceedings of the Chinese Society for Electrical Engineering, 24 (6), 189- 194.
Li, J., Sun, C. X. and Liao, R. J. (2004) Study on analysis method about fault diagnosis of transformer and degree of grey incidence
based on fuzzy clustering, Chinese Journal of Scientific Instrument, 25 (5), 587- 589.
Li, L. P., Zhang, X. L. and Wang, C. M. (2002) Theoretical and systematic study of the comprehensive evaluation of the operation state
of a large-sized steam turbine unit, J. of Engineering for Thermal Energy & Power, 17 (5), 442- 444.
Liu, S. F., Guo, T. B. and Dang, Y. G. (1999) Grey System Theory and Its Application, Beijing: Science Press.
Ren, S., Mu, D. J. and Zhu, L. B. (2006) Model of information security evaluation based on grey analytical hierarchy process, J. of
Computer Application, 26 (9), 2111- 2113.
Wang, H. Q., Wang, T. and Gu, Z. H. (2005) Grey prediction model and modification for city electric power demand, Proceedings of the
Chinese Society of Universities, 17 (2), 73- 75.
Xu, W. G., Tian, L. W. and Zhang, Q. Y. (2006) Study on modification and application of grey relation analysis model in evaluation of
atmospheric environmental quality, Environmental monitoring in China, 22 (3), 63- 66.
Zhang, C. H. and Liu, Z. G. (2006) Multivariable grey model and its application to prediction gas from boreholes, China Safety Science
Journal, 16 (6), 50- 54.
138

Proceedings of MIMAR 2007

Transcription

Similar documents

This Manual is provided FREE OF CHARGE from CBTricks.com

You need more than just a home inspection, you need the

Introducing Kresto Tool Box products

View our Safety Solutions PDF - Acklands

File - Monmouth Art

Complex systems require more than gut feeling.

This Manual is provided FREE OF CHARGE from CBTricks.com

Spring 2016 Our Best Rewards Yet in 2016!

Rust Repair - Fixing a Fender