Proficiency testing in analytical laboratories: how to make it work

Transcription

Proficiency testing in analytical laboratories: how to make it work
Accred Qual Assur (1996) 1 : 49–56
Q Springer-Verlag 1996
R. Wood
M. Thompson
Received: 3 November 1995
Accepted: 20 November 1995
R. Wood
Food Labelling and Standards Division,
Ministry of Agriculture, Fisheries and
Food, CSL Food Science Laboratory,
Norwich Research Park, Colney, Norwich
NR4 7UQ, UK
M. Thompson
Department of Chemistry, Birkbeck
College (University of London), Gordon
House, 29 Gordon Square, London
WC1H 0PP, UK
REVIEW PAPER
Proficiency testing in analytical
laboratories: how to make it work
Abstract This paper covers the
role of proficiency testing schemes
in providing an occasional but objective means of assessing and documenting the reliability of the data
produced by a laboratory, and in
encouraging the production of data
that are “fit-for-purpose”. A number of aspects of proficiency testing
are examined in order to highlight
features critical for their successful
implementation. Aspects that are
considered are: accreditation, the
economics and scope of proficiency
testing schemes, methods of scoring, assigned values, the target value of standard deviation sp, the
homogeneity of the distributed ma-
Introduction
It is now universally recognised that for a laboratory to
produce consistently reliable data it must implement an
appropriate programme of quality assurance measures.
Amongst such measures is the need for the laboratory
to demonstrate that its analytical systems are under statistical control, that it uses methods of analysis that are
validated, that its results are “fit-for-purpose” and that
it participates in proficiency testing schemes. These requirements may be summarised as follows.
Internal quality control
Internal quality control (IQC) is one of a number of
concerted measures that analytical chemists can take to
ensure that the data produced in the laboratory are under statistical control, i.e. of known quality and uncer-
terial, proficiency testing in relation to other quality assurance
measures and whether proficiency
testing is effective. Stress is placed
on the importance of any proficiency testing scheme adhering to a
protocol that is recognised, preferably internationally. It is also important that the results from the
scheme are transparent to both
participating laboratory and its
“customer”.
Key words Proficiency, testing 7
Fit for purpose 7 Internal quality
control 7 Harmonised international
protocol
tainty. In practice it is effected by comparing the quality of results achieved in the laboratory at a given time
with results from a standard of performance. IQC
therefore comprises the routine practical procedures
that enable the analytical chemist to accept a result or
group of results or to reject the results and repeat the
analysis. IQC is undertaken by the inclusion of particular reference materials, “control materials”, into the
analytical sequence and by duplicate analysis.
Analytical methods
Analytical methods should be validated as fit for purpose before use by a laboratory. Laboratories should
ensure that, as a minimum, the methods they use are
fully documented, laboratory staff trained in their use
and that they have implemented a satisfactory IQC system.
50
Proficiency testing
Proficiency testing and accreditation
Proficiency testing is the use of results generated in interlaboratory test comparisons for the purpose of a
continuing assessment of the technical competence of
the participating laboratories [1]. With the advent of
“mutual recognition” on both a European and world
wide basis, it is now essential that laboratories participate in proficiency testing schemes that will provide an
interpretation and assessment of results which is transparent to the participating laboratory and its “customer”.
Participation in proficiency testing schemes provides
laboratories with an objective means of assessing and
documenting the reliability of the data they are producing. Although there are several types of proficiency
testing schemes, they all share a common feature: test
results obtained by one laboratory are compared to an
external standard, frequently the results obtained by
one or more other laboratories in the scheme. Laboratories wishing to demonstrate their proficiency should
seek and participate in proficiency testing schemes relevant to their area of work. However, proficiency testing
is only a snapshot of performance at infrequent intervals – it will not be an effective check on general performance or an inducement to achieve fitness for purpose, unless it is used in the context of a comprehensive
quality system in the laboratory.
The principles of proficiency testing are now well established and understood. Nevertheless, there are some
aspects of practice that need further amplification and
comment. This paper aims to highlight some of these.
Despite the primary self-help objectives described
above, an acceptable performance in a proficiency testing scheme (where available) is increasingly expected
as a condition for accreditation. Indeed, in the latest revision of ISO Guide 25 it is a requirement that laboratories participate in appropriate proficiency testing
schemes whenever these are available [2]. Fortunately
both the accreditation requirements and the “self-help
intentions” can be fulfilled by the same means at one
and the same time.
Elements of proficiency testing
In analytical chemistry proficiency testing almost invariably takes the form of a simultaneous distribution
of effectively identical samples of a characterised material to the participants for unsupervised blind analysis
by a deadline. The primary purpose of proficiency testing is to allow participating laboratories to become
aware of unsuspected errors in their work and to take
remedial action. This it achieves by allowing a participant to make three comparisons of its performance:
with an externally determined standard of accuracy;
with that of peer laboratories; with its own past performance.
In addition to these general aims, a proficiency testing scheme should specifically address fitness for purpose, the degree to which the quality of the data produced by a participant laboratory can fulfil its intended
purpose. This is a critical issue in the design of proficiency testing schemes that will be discussed below.
History of proficiency testing: International Harmonised
Protocol
Proficiency testing emerged from the early generalised
interlaboratory testing that was used in different degrees to demonstrate proficiency (or rather lack of it),
to characterise analytical methods and to certify reference materials. These functions have now been separated to a large degree, although it is still recognised
that proficiency testing, in addition to its primary function, can sometimes be used to provide information on
the relative performance of different analytical methods for the same analyte, or to provide materials sufficiently well characterised for IQC purposes [3].
The systematic deployment of proficiency testing
was pioneered in the United States in the 1940s and in
the 1960s in the United Kingdom by the clinical biochemists, who clearly need reliable results within institutional units and comparability between institutions.
However, the use of proficiency testing is now represented in most sectors of analysis where public safety is
involved (e.g. in the clinical chemistry, food analysis, industrial hygiene and environmental analysis sectors)
and increasingly used in the industrial sector. Each of
these sectors has developed its own approach to the organisation and interpretation of proficiency testing
schemes, with any commonality of approach being adventitious rather than by collaboration.
To reduce differences in approach to the design and
interpretation of proficiency testing schemes the three
international organisations ISO, IUPAC and AOAC
INTERNATIONAL have collaborated to bring together the essential features of proficiency testing in the
form of The International Harmonised Protocol for the
Proficiency Testing of (Chemical) Analytical Laboratories [4, 5]. This protocol has now gained international
acceptance, most notably in the food sector. For the
food sector it is now accepted that proficiency testing
schemes must conform to the International Harmonised Protocol, and that has been endorsed as official
policy by the Codex Alimentarius Commission, AOAC
INTERNATIONAL and the European Union.
51
Studies on the effectiveness of proficiency testing
have not been carried out in a systematic manner in
most sectors of analytical chemistry, although recently
a major study of proficiency testing under the auspices
of the Valid Analytical Measurement (VAM) programme has been undertaken by the Laboratory of the
Government Chemist in the United Kingdom. However, the results have yet to be published (personal communication).
This paper comments on some critical aspects of
proficiency testing, identified as a result of experience
in applying the International Harmonised Protocol to
operational proficiency testing schemes.
Economics of proficiency testing schemes: requirement
for laboratories to undertake a range of
determinations offered within a proficiency testing
scheme
Proficiency testing is in principle adaptable to most
kinds of analysis and laboratories and to groups of laboratories of all sizes. However, it is most effectively
and economically applied to large groups of laboratories conducting large numbers of routine analyses. Setting up and running a scheme has a number of overhead costs which are best distributed over a large number of participant laboratories. Moreover, if only a
small range of activities is to be subject to test, then
proficiency testing can address all of them. If in a laboratory there is an extremely wide range of analyses that
it may be called upon to carry out (e.g. a food control
laboratory), it will not be possible to provide a proficiency test for each of them individually. In such a case
it is necessary to apply proficiency testing to a proportion of the analyses that can be regarded as representative. It has been suggested that for laboratories undertaking many different analyses a “generic” approach
should be taken wherever possible. Thus, for general
food analysis laboratories, they should participate in,
and achieve a satisfactory performance from, series
dealing with the testing of GC, HPLC, trace element
and proximate analysis procedures, rather than for every analyte that they may determine (always assuming
that an appropriate proficiency testing scheme is available). However, the basic participation should be supplemented by participation in specific areas where regulations are in force and where the analytical techniques applied are judged to be sufficiently specialised
to require an independent demonstration of competence. In the food sector examples of such analytes are
aflatoxins (and other mycotoxins), pesticides and overall and specific migration from packaging to food products.
However, it is necessary to treat with caution the inference that a laboratory that is successful in a particu-
lar proficiency scheme for a particular determination
will be proficient for all similar determinations. In a
number of instances it has been shown that a laboratory proficient in one type of analysis may not be proficient in a closely related one. Two examples of where
ability of laboratories to determinate similar analytes is
very variable are described here.
Example 1: Total poly- and (cis) mono-unsaturated
and saturated fatty acids in oils and fats
Results from proficiency testing exercises that include
such tests indicate that the determinations are of variable quality. In particular, the determination of poly-unsaturated and saturated fatty acids is generally satisfactory but the determination of mono-unsaturated fatty
acids is unduly variable with a bi-modal distribution of
results sometimes being obtained. Bi-modality might be
expected on the grounds that some participant laboratories were able to separate cis- from trans- mono-unsaturated fatty acids. However, examination of the methods of analysis used by participants did not substantiate
this – some laboratories reported results as if they were
separating cis- and trans- fatty acids even though the
analytical systems employed were incapable of such a
separation. This is clearly demonstrated in Reports
from the UK Ministry of Agriculture, Fisheries and
Food’s Food Analysis Performance Assessment Scheme [6].
Example 2: Trace nutritional elements (zinc, iron,
calcium etc.)
Laboratories have been asked to analyse proficiency
test material which contains a number of trace elements
of nutritional significance, e.g. for zinc, calcium and
iron etc. It has been observed that the number of laboratories which achieve “satisfactory” results for each
analyte determined in the same test material differs
markedly, thus suggesting that the assumption that the
satisfactory determination of one such analyte is indicative that a satisfactory determination would be observed for all similar analytes is not valid. This conclusion is generally assumed even if the elements are determined in a “difficult” matrix, such as in a foodstuff,
where many of the problems may be assigned to a matrix effect rather than the end-point determination.
Other limitations are apparent in proficiency testing.
For example, unless the laboratory uses typical analytical conditions to deal with the proficiency testing materials (and this is essentially out of the control of the organiser in most schemes) the result will not enable participants to take remedial action in case of inaccuracy.
This gives rise to a potential conflict between the reme-
52
dial and the accreditation roles of proficiency testing. It
is unfortunate that successful participation in proficiency testing schemes has become a “qualification” (or at
least poor performance a “disqualification”) factor in
accreditation. Nevertheless, it is recognised by most
proficiency testing scheme organisers that their primary
objective is to provide help and advise — not to “qualify” or “accredit” participants.
Finally, it must be remembered that extrapolation
from success in proficiency tests to proficiency in everyday analytical work is an assumption — in most circumstances it would be prohibitively expensive and practically difficult for a proficiency testing organiser to test
the proposition experimentally by using undisclosed
testing. However, most customers would anticipate that
performance in a proficiency testing exercise would be
the “best” that is achievable by a laboratory, and that
repeated poor performance in a proficiency testing
scheme is not acceptable.
Scoring
Converting the participant’s analytical results into
scores is nearly always an essential aid to the interpretation of the result. Those scores must be transparent to
both the laboratory and its “customer”; that customer
may be either a customer in the conventional sense or
an accreditation agency.
Raw analytical results are expressed in a number of
different units, cover a large range of concentrations
and stem from analyses that may need to be very accurate or may require only “order-of-magnitude” accuracy. An effective scoring system can reduce this diversity
to a single scale on which all results are largely comparable and which any analytical chemist or his client can
interpret immediately. Such a scoring system (the zscore) has been recommended in the International
Harmonised Protocol. A number of other scoring systems have evolved in the various proficiency testing
schemes which are presently operating; many of these
systems incorporate arbitrary scaling, the main function
of which is to avoid negative scores and fractions. However, all of these scores can be derived from two basic
types of score, the z-score and the q-score [4, 5].
The first action in converting a result into a score is
ˆ between the
to consider the error, the difference x-X
ˆ (X
ˆ being the best
result x and the assigned value X
available estimate of the true value). This error can
then be scaled by two different procedures:
q-scores
The q-score results by scaling the error to the assigned
ˆ )/X
ˆ . Values of q will be nearly zerovalue, i.e. qp(x-X
centred ( in the absence of overall bias among the participants). However, the dispersion of q will vary
among analytes often by quite large amounts and so
needs further interpretation. Thus a “stranger” to the
scheme would not be able to judge whether a score represented fitness for purpose — the scheme is not transparent.
z-scores
The z-score results by scaling the error to a target value
ˆ )/sp. If the partifor standard deviation, sp, i.e. zp(x-X
cipating laboratories as a whole are producing data that
are fit for purpose and are close to normally distributed
(as is often the case) the z-score can be interpreted
roughly as a standard normal deviate, i.e. it is zero-centred with a standard deviation of unity. Only a relatively few scores (F0.1%) would fall outside bounds of B
3 in “well-behaved” systems. Such bounds (normally
B3 or B2) are used as decision limits for the instigation of remedial action by individual laboratories. The
B 3 boundary has already been prescribed in the UK
Aflatoxins in Nuts, Nut Products, Dried Figs and Dried
Fig Products Regulations [7]. If participants as a whole
were performing worse than the fitness for purpose
specification, then a much larger proportion of the results would give z-scores outside the action limits. Because the error is scaled to the parameter sp it is immediately interpretable by both participating laboratory
and its customers.
Combining scores
Many scheme organisers and participants like to summarise scores from different rounds or from various
analytes within a single round of a test as some kind of
an average; various possibilities are suggested in the International Harmonised Protocol. Such combinations
could be used within a laboratory or by a scheme organiser for review purposes. Although it is a valid procedure to combine scores for the same analyte within
or between rounds, it has to be remembered that combination scores can mask a proportion of moderate deviations from acceptability. Combining scores from different analytes is more difficult to justify. Such a combination could for instance hide the fact that the results
for a particular analyte were always unsatisfactory. Use
of such scores outside the analytical community might
therefore give rise to misleading interpretations. Thus,
it must be emphasised that the individual score is most
informative; it is the score that should be used for any
internal or external “assessment” purposes and that
combination scores may, in some situations, disguise
unsatisfactory individual scores.
53
ˆ
The selection of assigned values, X
The method of determining the assigned value and its
uncertainty is critical to the success of the proficiency
testing scheme. An incorrect value will affect the scores
of all of the participants. For scheme co-ordinators the
ideal proficiency testing materials are certified reference materials (CRMs), because they already have assigned values and associated uncertainties. In addition,
participants cannot reasonably object to the assigned
value for the CRM as such values have normally been
derived by the careful (and expensive!) certification exercises, usually carried out on an international basis.
Unfortunately the use of CRMs for proficiency test material is relatively limited, as it is comparatively seldom
that an appropriate CRM can be obtained at sufficiently low cost for use in a proficiency testing exercise. In
addition, use in proficiency testing schemes is not the
primary objective in the preparation of CRMs. Considerable thought is therefore given to the validation of
materials specially prepared by organisers of proficiency testing schemes.
There are essentially three practical ways in which
the assigned value can be determined, these being:
Through test material formulation; from the consensus
mean from all participants; from the results from “expert laboratories”.
Test material formulation
An inexpensive and simple method is applicable and
available when the distributed material is prepared by
formulation, i.e. by mixing the pure analyte with a matrix containing none. The assigned value is then simply
calculated as the concentration or mass of analyte added to the matrix. The uncertainty can readily be estimated by consideration of the gravimetric and volumetric errors involved in the preparation, and is usually
small. A typical example where the technique applies is
the preparation of materials for alcoholic strength determinations where alcohol of known strength can be
added to an appropriate aqueous medium. However,
there are several factors which prevent this technique
being widely used. Unless the material is a true solution
it is difficult to distribute the analyte homogeneously in
the matrix. In addition, there is often a problem in obtaining the added analyte in the same chemical form as
the native analyte and, in many instances the nature of
the material itself would prevent its preparation by formulation. Generally, however, participants in the proficiency testing scheme have confidence in this method
of using the test material formulation to obtain the assigned value.
Consensus mean of results from all participants
The most frequently used procedure is probably that of
taking a consensus mean of all the participants. In
many sectors this is particularly appropriate where the
analyses under consideration are considered “simple”
or “routine”, and where the determination is well understood chemically or where there is a widely used
standard method. The procedure will apply specifically
where the method used is empirical or defining. In such
instances the consensus of the participants (usually a
robust mean of the results) is, by definition, the true
value within the limits of its uncertainty, which itself
can be estimated from a robust standard deviation of
the results.
When several distinct empirical methods are in use
in a sector for the determination of the same nominal
analyte, it is important to recognise that they may well
provide results that are significantly different. This
would give rise to a problem in identifying a consensus
value from a mixture of results of the methods. Therefore it is sometimes useful to prescribe the particular
empirical method that is to be used for obtaining the
consensus. Examples of empirical determinations are
the determination of “extractable copper” in a soil or
the proximate analysis (moisture, fat ash and crude
protein determinations) of a foodstuff.
Usually, although participants have confidence in
the consensus mean of the all-participants procedure,
there are instances where it is much less appropriate to
use the consensus mean, e.g. when the analysis is regarded as difficult and where the analyte is present at
low trace levels. Under those circumstances it is by
means rare for the consensus to be significantly biased
or, in some instances, for there to be no real consensus
at all. Organisers should recognise that such an approach encourages consistency among the participants
but not necessarily trueness. It can easily institutionalise faulty methods and practices. An example of this is
given in Fig. 1, where the results are displayed as recommended in the Harmonised Protocol. However, the
mean has been calculated as a consensus mean for all
participants. As a result, the laboratories 7865, 2504
and 3581 appear to be different from other participants.
However, in this example, it is generally recognised
these laboratories are “expert” in the particular technique being assessed. In such situations the “expert laboratory” procedure should be adopted.
Results from “expert laboratories”
Should the above methods of obtaining an assigned value be inappropriate, the organiser may resort to the use
of analysis by expert laboratories using definitive or
54
Fig. 1
standard methods. In principle one such laboratory
could be used if the participants felt confident in the
result. However, in many instances it would be better if
concurrent results from several expert laboratories
were used. This is obviously an expensive option unless
the experts are normal participants in the scheme. A
possible modification of this idea would be the use of
the consensus of the subset of the participants regarded
as completely satisfactory performers. Organisers using
this strategy must be vigilant to avoid the drift into a
biased consensus previously mentioned.
The uncertainty ua on the assigned value is an important statistic. It must not be too large in relation to
the target value for standard deviation (discussed below); otherwise the usefulness of the proficiency testing
is compromised. This problem occurs if the possible
variability of the assigned value becomes comparable
with the acceptable variability defined by sp. As a rule
of thumb ua must be less than 0.3 sp to avoid such difficulties [8].
The selection of the target value
of standard deviation, sp
This is another critical parameter of the scoring system.
Ideally the value of sp should represent a fitness for
purpose criterion, a range for interlaboratory variation
that is consistent with the intended use of the resultant
data. Hence a satisfactory performance by a participant
should result in a z-score within the range B2 (although B3 may be used in some situations). Obviously
the value of sp should be set by the organisers considering what is in fact fit for purpose, before the proficiency testing begins. This precedence is essential so
that the participants know in advance what standard of
performance is required. A number of proficiency testing schemes do not take this approach and use the value of sp generated from within the reported results for
any one distribution of test material. This is inappropriate, as by definition about 95% of participants will
automatically be identified as being satisfactory — even
if by any objective consideration they were not. The use
55
of an external standard of performance is therefore essential.
As the concentration of the analyte is unknown to
the participants at the time of analysis, it may be necessary to express the criterion as a function of concentration rather than a single value applicable over all concentrations. It is also important that the value of sp
used for an analysis should remain constant over extended periods of time, so that z-scores of both individual participants and groups of participants remain comparable over time.
As stressed above, the foregoing excludes the possibility of using the actual robust standard deviation of a
round of the test as the denominator in the calculation
of z-scores. It also excludes the use of criteria that
merely describes the current state of the art. Such practice would undoubtedly serve to identify outlying results but would not address fitness for purpose. It could
easily seem to justify results that were in fact not fit for
purpose. Moreover, it would not allow comparability of
scores over a period of time.
The question of how to quantify fitness for purpose
remains incompletely answered. A general approach
has been suggested based on the minimisation of cost
functions [9], but has yet to be applied to practical situations. Specific approaches based on professional judgements are used in various sectors. In the food industry the Horwitz function [10] is often taken as a fitness
for purpose (acceptability) criterion whereas in others,
e.g. in clinical biochemistry, criteria based on probabilities of false positives and negatives have evolved [11].
In some areas fitness for purpose may be determined
by statutory requirements, particularly where method
performance characteristics are prescribed, as by the
European Union [12] and the Codex Alimentarius
Commission for veterinary drug residues methods.
Homogeneity of the distributed material
As most chemical analysis is destructive, it is essentially
impracticable to circulate among the participants a single specimen as a proficiency testing material. The alternative is to distribute simultaneously to all participants samples of a characterised bulk material. For this
to be a successful strategy the bulk material must be
essentially homogeneous before the subdivision into
samples takes place. This is simple in the instance
where the material is a true solution. In many instances,
however, the distributed material is a complex multiphase substance that cannot be truly homogeneous
down to molecular levels. In such a case it is essential
that the samples are at least so similar that no perceptible differences between the participants’ results can be
attributed to the proficiency testing material. This condition is called “sufficient homogeneity”. If it is not de-
monstrated the validity of the proficiency testing is
questionable.
The International Harmonised Protocol recommends a method for establishing sufficiently homogeneity. (More strictly speaking, the test merely fails to
detect significant lack of inhomogeneity.) After the
bulk material has been homogenised it is divided into
the test material for distribution. Ten or more of the
test materials are selected at random and analysed in
duplicate under randomised repeatability conditions by
a method of good precision and appropriate trueness.
The results are treated by analysis of variance and the
material is deemed to be sufficiently homogeneous, if
no significant variation between the analyses is found,
or if the between-sample standard deviation is less than
0.3 sp.
There is a potential problem with the test for homogeneity — it may be expensive to execute because it
requires at least 20 replicate analyses. In the instance of
a very difficult analysis dependent on costly instrumentation and extraction procedures, e.g. the determination of dioxins, the cost of the homogeneity test may be
a major proportion of the total cost of the proficiency
test. Moreover, if the material is found to be unsatisfactory, the whole procedure of preparation and testing
has to be repeated. Some organisers are so confident of
their materials that they do not conduct a homogeneity
test. However, experience in some sectors has shown
that materials found to be satisfactory in some batches
are decidedly heterogeneous in other batches after the
same preparative procedures. Another complication of
such testing is that a single material may prove to be
acceptable for one analyte and heterogeneous for another. A possible strategy that could be used with care
is to store the random samples selected before distribution, but to analyse them only if the homogeneity of the
material is called into question after the results have
been examined. However, no remedial action would
then make the round of the proficiency testing usable if
heterogeneity were detected, so the whole round would
have to be repeated to provide the proficiency information for the participants. In general, it seems that homogeneity tests are a necessary expense, unless the distributed material is a true solution that has been adequately mixed before subdivision.
Proficiency testing and other quality assurance
measures
While proficiency testing provides information for a
participant about the presence of unsuspected errors, it
is completely ineffectual unless the proficiency testing
is an integral part of the formal quality system of the
laboratory. For example, proficiency testing is not a
substitute for IQC, which should be conducted in every
56
run of analysis for the detection of failures of the analytical system in the short term to produce data that are
fit for purpose. It seems likely that the main way in
which proficiency testing benefits the participant laboratory is in compelling it to install an effective IQC system. This actually enhances the scope of the proficiency
testing scheme. The IQC system installed should cover
all analyses conducted in the laboratory, and not just
those covered by the proficiency testing scheme. In one
scheme it was shown that laboratories with better-designed IQC systems showed considerably better performance in proficiency testing [13]. A crucial feature
of a successful IQC scheme was found to be a control
material that was traceable outside the actual laboratory.
An important role of proficiency testing is the triggering of remedial action within a laboratory when unsatisfactory results are obtained. Where possible the
specific reason for the bad result should be determined
by reference to documentation. If consistently bad results are obtained, then the method used (or the execution of the method protocol) must be flawed or perhaps
applied outside the scope of its validation. Such interaction encourages the use of properly validated methods
and the maintenance of full records of analysis.
Does proficiency testing work?
There are two aspects of proficiency tests that need
consideration in judging their efficacy. These aspects
concern the “inliers” and the “outliers” among the results of the participants in a round. The inliers represent the laboratories that are performing consistently as
a group but may need to improve certain aspects of
their performance by attention to small details of the
method protocol. The outliers are laboratories that are
making gross errors perhaps by committing major deviations from a method protocol, by using an improperly validated method, by using a method outside the
scope of its validation, or other comparable faults.
The dispersion of the results of the inliers (as quantified by a robust standard deviation) in an effective
scheme would be expected at first to move round by
round towards the value of sp and then stabilise close
to that value. Ideally then, in a mature scheme the proportion of participants falling outside defined z-scores
should be roughly predictable from the normal distribution. Usually in practice in a new scheme the dispersion will be considerably greater than sp in the first
round but improve rapidly and consistently over the
subsequent few rounds. Then the rate of improvement
decreases towards zero.
If the incentives to perform well are not stringent,
the performance of the group of laboratories may stabilise at a level that does not meet the fitness for purpose requirement. Examples of this may be found in
some schemes where the proportion of participants obtaining satisfactory results in, say, the determination of
pesticides has increased over time but has now stabilised at F70% rather than the 95% which is the ultimate objective. However, where there are external constraints and considerations (e.g. accreditation) the proportion of outliers rapidly declines round by round
(discounting the effects of late newcomers to the scheme) as the results from the scheme markedly penalise
such participants. In addition, in view of the importance
of proficiency testing schemes to the accreditation process, the need for proficiency testing schemes to themselves become either accredited or certified needs to be
addressed in the future.
Conclusion
Although it is now a formal requirement in many sectors that analytical laboratories participate in proficiency testing schemes, such participation is not always
without problems. Particular issues that proficiency
testing scheme organisers should address include the
selection of the procedure for determining the assigned
value and an external standard for target standard deviation. It is also important that they adhere to a protocol that is recognised, preferably internationally, and
that the results from the scheme are transparent to both
participant laboratory and its “customer”.
References
1. ISO Guide 43 (1993), 2nd Edition,
Geneva
2. ISO Guide 25 (1993), 2nd Edition,
Geneva
3. Thompson M, Wood R (1995), Pure
Appl Chem 67 : 649–666
4. Thompson M, Wood R (1993) Pure
Appl Chem 65 : 2123–2144
5. Thompson M, Wood R (1993) J
AOAC International 76 : 926–940
6. Report 0805 of the MAFF Food
Analysis Performance Assessment
Scheme, FAPAS Secretariat, CSL
Food Laboratory, Norwich, UK
7. Statutory Instrument 1992 No. 3326,
HMSO, London
8. Statistics Sub-Committee of the AMC
(1995) Analyst 120 : 2303–2308
9. Thompson M, Fearn T, Analyst, (in
press)
10. Horwitz W (1982) Anal Chem
54 : 67A–76A
11. IFCC approved recommendations on
quality control in clinical chemistry,
part 4. Internal quality control”
(1980) J Clin Chem Clin Biochem
18 : 534–541
12. Offical Journal of the European Union, No. L118 of 14.5.93, p64
13. Thompson M, Lowthian P J (1993)
Analyst 118 : 1495–1500