This article was downloaded by: [Rand R. Wilcox]

Transcription

This article was downloaded by: [Rand R. Wilcox]
This article was downloaded by: [Rand R. Wilcox]
On: 10 August 2012, At: 13:53
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Statistical Computation and
Simulation
Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/gscs20
Some small-sample properties of some
recently proposed multivariate outlier
detection techniques
Rand R. Wilcox
a
a
Department of Psychology, University of Southern California,
California, USA
Version of record first published: 18 Aug 2008
To cite this article: Rand R. Wilcox (2008): Some small-sample properties of some recently
proposed multivariate outlier detection techniques, Journal of Statistical Computation and
Simulation, 78:8, 701-712
To link to this article: http://dx.doi.org/10.1080/00949650701245041
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Journal of Statistical Computation and Simulation
Vol. 78, No. 8, August 2008, 701–712
Some small-sample properties of some recently proposed
multivariate outlier detection techniques
RAND R. WILCOX*
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
Department of Psychology, University of Southern California, California, USA
(Received 29 September 2006; in final form 27 January 2007)
Recently, several new robust multivariate estimators of location and scatter have been proposed that
provide new and improved methods for detecting multivariate outliers. But for small sample sizes,
there are no results on how these new multivariate outlier detection techniques compare in terms
of pn , their outside rate per observation (the expected proportion of points declared outliers) under
normality. And there are no results comparing their ability to detect truly unusual points based on the
model that generated the data. Moreover, there are no results comparing these methods to two fairly
new techniques that do not rely on some robust covariance matrix. It is found that for an approach
based on the orthogonal Gnanadesikan–Kettenring estimator, pn can be very unsatisfactory with small
sample sizes, but a simple modification gives much more satisfactory results. Similar problems were
found when using the median ball algorithm, but a modification proved to be unsatisfactory. The
translated-biweights (TBS) estimator generally performs well with a sample size of n ≥ 20 and when
dealing with p-variate data where p ≤ 5. But with p = 8 it can be unsatisfactory, even with n = 200.
A projection method as well the minimum generalized variance method generally perform best, but
with p ≤ 5 conditions where the TBS method is preferable are described. In terms of detecting truly
unusual points, the methods can differ substantially depending on where the outliers happen to be, the
number of outliers present, and the correlations among the variables.
Keywords: Robust methods; OGK estimator; TBS estimator; Median ball algorithm; Minimum
generalized variance technique; Projection methods
1.
Introduction
In various settings, multivariate outlier detection methods play an important role [1, 2]. When
choosing a multivariate outlier detection technique, at least two fundamental properties are
of interest. The first is the outside rate per observation, which is the expected proportion
of outliers among a sample of size n, say pn . When sampling from a multivariate normal
distribution, generally it is desired to have pn reasonably small, say 0.05, and often methods
are ‘tuned’ to achieve this goal, at least when n is large [3]. With small to moderate sample
sizes, however, typically it is unclear how well this goal is achieved and, at least in some cases,
some adjustment is needed when n is small. A recent example in the univariate case can be
*Email: [email protected]
Journal of Statistical Computation and Simulation
ISSN 0094-9655 print/ISSN 1563-5163 online © 2008 Taylor & Francis
http://www.tandf.co.uk/journals
DOI: 10.1080/00949650701245041
702
R. R. Wilcox
found in Carling [4] who suggested a modification of the boxplot rule so that the outside rate
per observation is fairly stable as a function of n.
A second fundamental goal is to avoid masking. Roughly, a method is said to suffer from
masking if the very presence of outliers causes them to be missed. Let M be some multivariate
measure of location, based on data randomly sampled from some p-variate distribution, and
let C be some measure of scatter. If M is the usual sample mean and C the usual covariance
2 matrix, based on X1 , . . . , Xn , then the classic approach is to use Mahalanobis distance
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
Di =
(Xi − M)C −1 (Xi − M)
(1)
and declare Xi an outlier if Di is sufficiently large. In particular, if the goal is to have pn = α,
declare Xi an outlier if
2
,
(2)
Di ≥ χ1−α/2,p
the square root of the 1 − α/2 quantile of a χ -squared distribution with p degrees of freedom.
But it is well known that this method suffers from masking [1], roughly because the usual
sample mean and covariance matrix are not robust. That is, outliers can greatly influence their
values, which can cause Di to be small even when Xi is highly atypical.
A seemingly natural approach to avoid masking is to take M and C to be some robust
measure of location and scatter in equation (1) and then use equation (2). Campbell [5]
proposed using a particular M-estimator, but the M-estimator used has a rather unsatisfactory
breakdown point, where the breakdown point of an estimator is the smallest proportion of
points that must be altered to make it arbitrarily large or small. The M-estimator used has
a breakdown point of only 1/(p + 1), meaning that masking can be a problem, particularly
as p gets large. Consequently, Rousseeuw and van Zomeren [3] suggest using the minimum
volume ellipsoid (MVE) estimator, which was introduced by Rousseeuw [6] and which is
discussed in detail by Rousseeuw and Leroy [1]. It seems that this method performs quite
well in terms of achieving pn ≈ 0.05 [2]. But concerns have been expressed by Olive [7]
as well as Hawkins and Olive [8] and Fung [9] describes conditions where it can declare
too many points outliers. Rousseeuw and van Driessen [10] suggest replacing the (MVE)
estimator with the fast minimum covariance determinant (FMCD) estimator, but with small
to moderate sample sizes, pn becomes unstable and might exceed 0.05 by an unacceptable
amount [2].
There are at least three alternatives to the MVE and FMCD estimators that might be used
instead: the median ball algorithm (MBA) proposed by Olive [7], the so-called orthogonal
Gnanadesikan–Kettenring (OGK) estimator suggested by Maronna and Zamar [11], and the
translated-biweights (TBS) estimator derived by Rocke [12]. But it seems that nothing is
known about how these methods perform in terms of pn . One goal here is to report some
small-sample size results relevant to this issue. Another goal is to include results on two other
outlier detection methods that do not use Mahalanobis distance.
2.
Description of the methods
The first portion of this section describes the measures of location and scatter that were
considered when using equations (1) and (2). Then the computational details for the other two
outlier detection methods are provided.
Multivariate outlier detection techniques
2.1
703
The OGK estimator
In its general form, the OGK estimator, derived by Maronna and Zamar [11], is applied as
follows. Let σ (X) and μ(X) be any measures of dispersion and location, respectively. The
method begins with the robust covariance between any two variables, say X and Y , which was
proposed by Gnanadesikan and Kettenring [13]:
cov(X, Y ) =
1
(σ (X + Y )2 − σ (X − Y )2 ).
4
(3)
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
When σ (X) and μ(X) are the usual standard deviation and mean, respectively, the usual
covariance between X and Y results. Here, following Maronna and Zamar, σ (X) is taken
to be the tau scale of Yohai and Zamar [14]. Let
x 2 2
Wc (x) = 1 −
I (|x| ≤ c)
c
and
ρc (x) = min(x 2 , c2 ),
where the indicator function I (|x| ≤ c) = 1 if |x| ≤ c and 0 otherwise. For the univariate
sample X1 , . . . , Xn , let MAD(X) be the median of |X1 − Mx |, . . . , |Xn − Mx |, where Mx is
the usual median of X1 , . . . , Xn , and let
Xi − Mx
ωi = Wc1
.
MAD(X)
Then the location and scale statistics are defined as
μ(X) =
ωi Xi
ωi
and
MAD
σ (X) =
ρc2
n
2
Xi − μ(X)
.
MAD(X)
Using this measure of scale in equation (3), the resulting measure of covariance will be
denoted by υ(X, Y ). Again following Maronna and Zamar, the choices for c1 and c2 are taken
to be 4.5 and 3, respectively.
Following the notation in Maronna and Zamar [11], let xi be the ith row of the n × p
matrix X. Then Maronna and Zamar define a scatter matrix V(X) and a location vector t(X)
as follows:
1.
2.
3.
4.
Let D = diag(σ (X1 ), . . . , σ (Xp )) and yi = D−1 xi , i = 1, . . . , n.
Compute U = (Ujk ) by applying v to the columns of Y.
Compute the eigenvectors ej of U and call E the matrix whose columns are the ej ’s.
Let A = DE, zi = A−1 xi , in which case
V(X) = AA
and
t(X) = Aν,
where = diag(σ (Z1 ) , . . . , σ (Zp )) and ν = (μ(Z1 ), . . . , μ(Zp )).
2
704
R. R. Wilcox
Maronna and Zamar [11] note that the above procedure can be iterated and report results
suggesting that a single iteration be used. More precisely, compute V and t for Z (the matrix
corresponding to zi computed in step 4) and then express them in the original coordinate
system, namely, V2 = AV(Z)A and t2 (X) = At(Z).
Maronna and Zamar show that the estimate can be improved by a reweighting step. Let
zij − μ(Zj ) di =
σ (Zj )
j
and ωi = I (di ≤ d0 ), where
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
d0 =
2
med(d1 , . . . , dn )
χp,β
2
χp,.5
,
2
is the β quantile of the χ -squared distribution with p degrees of freedom and
where χp,β
‘med’ denotes the median. Then the measure of location is now estimated to be
tω =
ωi xi
ωi
and the measure of scatter is
Vω =
ωi (xi − tω )(xi − tω )
.
ωi
Here, when using the OGK estimator, equation (2) was used to check for outliers. Results
reported by Maronna and Zamar [11] suggest using β = 0.9, but here it is found that this
can result in pn exceeding 0.05 by a considerable amount when n is small, and moreover,
pn is rather unstable as a function of n. A preliminary simulation was run with the goal of
possibly adjusting β so that pn is approximately 0.05 under multivariate normality. Based on
simulation results with n = 10, 20, 50, 100, 200 and p = 2(1) 10, the following choice for β
is suggested:
β = max(0.95, min(0.99, 1/n + 0.94))
and this choice will be used henceforth.
2.2
The TBS estimator
Rocke [12] proposed what is called a TBS estimator. Generally, S-estimators of multivariate
location and scatter are values for θˆ and S that minimize |S|, the determinant of S, subject to
1
ˆ 1/2 ) = b0 ,
ξ(((Xi − θˆ ) S−1 (Xi − θ))
n i=1
n
(4)
where b0 is some constant, and ξ is a non-decreasing function, but Rocke [12] showed that
S-estimators can be sensitive to outliers even if the breakdown point is close to 0.5. He
suggested an alternative approach where the function ξ(d) is defined as follows. Let m and c
be values to be determined. When m ≤ d ≤ m + c,
m4
m2
m2
m2 (m4 − 5m2 c2 + 15c4 )
2
0.5
+
+
d
−
ξ(d) =
−
2
30c4
2c4
c2
2
4m
4md5
4m3
1
d6
4 3m
+ d3
+
d
−
−
−
+
,
3c2
3c4
2c4
2c2
5c4
6c4
Multivariate outlier detection techniques
705
for 0 ≤ d < m,
ξ(d) =
d2
,
2
and for d > m + c,
m2
c(5c + 16m)
+
.
2
30
The values for m and c can be chosen to achieve the desired breakdown point and the asymptotic
rejection probability, roughly referring to the probability that a point will get zero weight when
the sample size is large. If the asymptotic rejection probability is to be γ , say, then m and c
are determined by
Eχp2 (ξ(d)) = b0 ,
ξ(d) =
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
and
m+c =
2
χp,1−γ
.
An iterative estimation method is used to compute the measures of location and scatter [15],
which requires an initial estimate of location and scatter. Here this initial estimate is the MVE
estimator, which was computed with the S-PLUS function cov.mve, but some results on using
an alternative initial estimate are mentioned in section 3. As with the OGK estimator, when
using TBS, checks for outliers are based on equation (2).
2.3
Median ball algorithm
Following Olive [7], the reweighted MBA (RMBA) begins with two initial estimates of location and scatter, both of which are based on an iterative algorithm. The basic strategy is as
follows. For the j th estimator (j = 1, 2), let (T0,j , C0,j ) be some starting value. Compute all n
Mahalanobis distances Di ((T0,j , C0,j ) based on this measure location and scatter. The next
iteration consists of estimating the usual mean and covariance matrix based on the cn ≈ n/2
cases corresponding to the smallest distances, yielding (Ti,j , C1,j ). Repeating this process,
based on Di (T1,j, C1,j ) yields an updated measure of location and scatter, (T2,j , C2j ). As done
by Olive, (T5,j , C5,j ) is used here. The first of the two starting values used by Olive takes
(T0,1 , C0,1 ) to be the usual mean and covariance matrix. The other starting value, (T0,2, C0,2 ),
is the usual mean and covariance based on the cn cases that are closest to the coordinate wise
median in Euclidean distance. Let (TA , CrmA ) = (T5,i , C5,i ), where i = 1 if the determinant
|C5,1 | ≤ | C5,2 |, otherwise i = 2. The RMBA estimator of location is TA and the measure of
scatter is
MED(Di2 (TA , CA ))
CA .
CRMBA =
2
χp,0.5
√
Olive and Hawkins [16] show that the RMBA estimator is a n consistent. (The R function
RMBA, available at www.math.siu.edu/olive/rpack.txt, computes the RMBA estimate of
location and scatter and was used in the simulations described in section 3.)
It was found that if Mahalanobis distance is computed using the RMBA estimator, and points
are declared outliers using (2) with α = 0.975, the outside rate per observation is reasonably
close to 0.05 under normality, provided that n/p ≥ 10, at least for 2 ≤ p ≤ 12. But otherwise
the outside rate per observation can be very unsatisfactory. For example, with n = 20 and
p = 5 it was estimated to exceed 0.24 regardless of the correlation among the variables. So
this approach is not as satisfactory as other methods considered when n is small, but the RMBA
measure of location does seem to be practical when used in conjunction with the minimum
generalized variance (MGV) method described next.
706
R. R. Wilcox
2.4 The MGV method
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
The next approach is based on what is called the MGV method. It has certain similarities to the
forward search method described by Atkinson et al. [17, p. 4], but it differs in ways that will
become clear. From basic multivariate techniques, the generalized variance is the determinant
of the usual covariance matrix; it reflects how tightly a cloud of points is clustered together.
The method in this section is based on the fact that the generalized variance is not robust; a
single unusual point can greatly inflate its value.
The MGV method is applied as follows.
1. Initially, all n points are described as belonging to set A.
2. Find the p points that are most centrally located. There are many ways this might be done
and two are considered here. The first is based on
p n Xj − Xi 2
di =
,
(5)
MAD
j =1
=1
where MAD is the value of median absolute deviation (MAD) based on X1 , . . . , Xn . The
p most centrally located points are taken to be the points having the p smallest di values.
The second and generally more satisfactory approach takes thep most centrally located
points to be the p points having the smallest Mahalanobis distance based on the RMBA
estimators, TA and CRMBA .
• Remove the p centrally located points from set A and put them into set B. At this step, the
generalized variance of the points in set B is zero.
• If among the points remaining in set A, the ith point is put in set B, the generalized variance
of the points in set B will be changed to some value which is labeled sgi2 . That is, associated
with every point remaining in A is the value sgi2 , which is the resulting generalized variance
when it, and it only, is placed in set B. Compute sgi2 for every point in A.
• Among the sgi2 values computed in the previous step, permanently remove the point associated with the smallest sgi2 value from set A and put it in set B. That is, find the point in
set A, which is most tightly clustered together with the points in set B. Once this point is
identified, permanently remove it from A and leave it in B henceforth.
• Repeat steps 4 and 5 until all points are now in set B.
The first p points removed from set A have a generalized variance of zero which is labeled
2
2
= · · · = sg(p)
= 0. When the next point is removed from A and put into B (using steps 4
sg(1)
2
and 5), the resulting generalized variance of the set B is labeled sg(p+1)
and continuing this
process, each point has associated with it some generalized variance when it is put into set B.
Based on the process just described, the ith point has associated with it one of the generalized
variances just computed. For convenience, this generalized variance associated with the ith
2
point, sg(j
) , is labeled Ci . The p deepest points have C values of zero. Points located at the
edges of a scatterplot have the highest C values meaning that they are relatively far from the
center of the cloud of points. A strategy for detecting outliers is simply applying some good
univariate outlier rule to the Ci values. Note that a point would not be declared an outlier if
Ci is small, only if Ci is large.
In terms of maintaining an outside rate per observation that is stable as a function of n
and p, and approximately equal to 0.05 under normality, a boxplot rule for detecting outliers
seems best when p = 2, and for p > 2 a slight generalization of Carling’s [4] modification of
the boxplot rule appears to perform well. In particular, if p = 2, then declare the ith point an
Multivariate outlier detection techniques
707
Ci > q2 + 1.5(q2 − q1 ),
(6)
outlier if
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
where q1 and q2 are the ideal fourths based on the Ci values. For p > 2 variables, replace
equation (5) with
2
(q2 − q1 ),
(7)
Ci > MC + χ0.975,p
2
is the square root of the 0.975 quantile of a χ -squared distribution with
where again χ0.975,p
p degrees of freedom and MC is the usual median of the Ci values.
A possible criticism, when detecting outliers among the Ci values, is that the interquartile
range has a breakdown point of 0.25. Ideally, a univariate outlier detection method would have
a breakdown point of 0.5, the highest possible value. This can be achieved with a commonly
used MAD-median rule. When p = 2, for example, this means that a point Xi is declared an
outlier if
|Ci − Mc |
> 2.24,
(8)
MADC /0.6745
where MADC is the value of MAD based on the C values. A concern about this approach is
that the outside rate per observation is no longer stable as a function of n and no method for
correcting this problem is available at this time.
For completeness, details of the forward search method suggested by Atkinson et al. [17] are
outlined in order to provide some sense of how it compares to the MGV method. The forward
search method begins with a subset of m points. Based on these m points, Mahalanobis distance
is computed for all n points and the m + 1 points having the smallest Mahalanobis distance
forms the new subset. This is done until all n points are included. Then a plot is created [17, p. 7]
and simulations are used to determine the kind of fluctuations that are to be expected based
on this plot. The Mahalanobis distances are calculated for each sample and ordered. Another
approach is to plot the change in the Mahalanobis distances as each new point is added;
see figure 1.3 in Atkinson et al. So unlike MGV, ellipsoidal regions are used to determine
whether points are outliers. In some sense this approach might compete well with the methods
considered here, but due to the role of the plots that are used, including it in the simulations
used here is difficult at best.
2.5 A projection method
Consider any projection of the data onto a straight line. A projection-type method for detecting
outliers among multivariate data is based on the idea that if a point is an outlier, then it should
be an outlier for some projection of the n points. So if it were possible to consider all possible
projections, and if for some projection a point is an outlier, then the point is declared an outlier.
Not all projections can be considered, so following Wilcox [2], the strategy is to orthogonally
project the data onto all n lines formed by the center of the data cloud, as represented by ξˆ , and
each Xi . It seems natural that ξˆ should have a high breakdown point. Here, two choices for ξˆ
were considered. The first is a weighted mean, based in part on FMCD, which is computed by
the S-PLUS function cov.mcd. More precisely, Mahalanobis distance is computed using the
FMCD measures of location and scatter, any points flagged outliers, using (2), are removed,
and the center of location is taken to be the mean of the remaining values. The second measure
of location considered was RMBA.
The computational details are as follows. Fix i, and for the point Xi , orthogonally project
all n points onto the line connecting ξˆ and Xi , and let Dij be the distance between ξˆ and Xi ,
708
R. R. Wilcox
based on this projection. More formally, let
Ai = Xi − ξˆ ,
Bj = Xj − ξˆ ,
where both Ai and Bj are column vectors having length p and let
Cj =
Aj Bj
Bj Bj
Bj ,
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
j = 1, . . . , n. Then when projecting the points onto the line between Xi and ξˆ the distance of
the j th point from ξˆ is
Dij = Cj ,
where
Cj =
Cj21 + · · · + Cjp2 .
Here, an extension of Carling’s modification of the boxplot rule (similar to the modification
used by the MGV method) is used to check for outliers among the Dij values. Let = [n/4 +
5/12], Let where [·] is the greatest integer function, and let
n
5
+
− .
4 12
For fixed i let Di(1) ≤ · · · ≤ Di(n) be the n distances written in ascending order. The ideal
fourths associated with the Dij values are
h=
q1 = (1 − h)Di(h) + hDi(h+1)
and
q2 = (1 − h)Di() + hDi(−1) ,
Then the j th point is declared an outlier if
Dij > MD +
2
χ0.975,p
(q2 − q1 ),
(9)
where MD is the usual sample median based on the Di1 , . . . , Din values.
The process just described is for a single projection; for fixed i points are projected
onto the line connecting Xi to ξˆ . Repeating this process for each i, i = 1, . . . , n, point is
declared an outlier if for any of these projections, it satisfies equation (9). This will be
called method OP, which has certain similarities with a projection method suggested by
Pe˜na and Prieto [18]. One important difference is that the method used by Pe˜na and Prieto
is based on the usual sample mean, which is not robust and which in turn could result in
masking.
As was the case with the MGV method, a simple and seemingly desirable modification of
the method just described is to replace the interquartile range with the MAD measure of scale
based on the values Di1 , . . . , Din . So here, MAD is the median of the values
|Di1 − MD |, . . . , |Din − MD |,
which is denoted by MADi . Then the j th point is declared an outlier if for any i
MADi
2
Dij > MD + χ0.95,p
.
(10)
0.6745
Equation (10) represents an approximation of the method given by equation (1.3) in Donoho
and Gasko [19]. Again, an appealing feature of MAD is that it has a higher finite sample
Multivariate outlier detection techniques
709
breakdown point than the interquartile range. But a negative feature of equation (10) is that the
outside rate per observation appears to be less stable as a function of n. In the bivariate case, for
example, it is approximately 0.09 with n = 10 and drops below 0.02 as n increases. For the
same situations, the outside rate per observation using equation (9) ranges, approximately,
between 0.043 and 0.038.
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
3.
Simulation results
Simulations were used to estimate pn under normality, and some additional simulations
were used as a partial check on the ability of the methods to detect truly unusual points.
All simulations were done in S-PLUS, which has a built-in function for computing the
TBS estimator. The OGK estimator was applied using software written by K. Konis, with
a minor error corrected by V. Todorov. The MBA was applied with R code written by
D. Olive, which was previously mentioned. Methods OP (using FMCD) and MGV were
applied with software in Wilcox [2]. When generating data from a multivariate normal distribution, a common correlation, ρ, was used, each method was applied, and the number
of points flagged as outliers was recorded. (All marginal distributions have mean 0 and
variance 1.) This process was replicated 1000 times resulting in pˆ n , the average proportion of points declared outliers. Table 1 shows the results for n = 20 and ρ = 0 and 0.9.
(Results under MGV used the first measure of location mentioned in section 2.4. The column MGV(RMBA) reports results when the MGV method uses the RMBA as the measure of
location.
It is noted that some additional simulations were run where the correlation between variables
j and k were taken to be ρjk = j k/(p + 1)2 The results were very similar to those for ρ = 0,
so for brevity they are not reported.
The goal is to have pn reasonably close to 0.05. If we view a method as unacceptable when
pˆ n exceeds 0.075, no method is acceptable among all of the conditions considered. This is
not surprising when p is large relative to n. For example, Rousseeuw and van Zomeren [3]
indicate that their method should be used only when n/p ≥ 5. All indications are that with
n/p ≥ 5, methods MGV and OP provide adequate control over pn , at least when p ≤ 12, but
when p = 2, methods OGK and TBS can be unsatisfactory when n = 20.
Table 1.
n
20
20
100
100
20
20
100
100
20
20
40
40
60
60
100
100
Outside rate per observation.
ρ
p
OGK
TBS
MGV
OP
MGV(RMBA)
0.0
0.9
0.0
0.9
0.0
0.9
0.0
0.9
0.0
0.9
0.0
0.9
0.0
0.9
0.0
0.9
2
2
2
2
5
5
5
5
8
8
8
8
8
8
8
8
0.047
0.081
0.054
0.093
0.071
0.062
0.061
0.062
0.090
0.069
0.091
0.075
0.081
0.068
0.068
0.061
0.078
0.078
0.031
0.005
0.037
0.038
0.028
0.027
0.015
0.032
0.031
0.062
0.033
0.069
0.107
0.109
0.068
0.068
0.053
0.052
0.064
0.068
0.030
0.033
0.091
0.046
0.044
0.053
0.038
0.046
0.040
0.046
0.045
0.049
0.026
0.019
0.067
0.011
0.035
0.005
0.081
0.064
0.064
0.003
0.054
0.003
0.045
0.002
0.069
0.072
0.052
0.053
0.063
0.065
0.031
0.031
0.088
0.091
0.045
0.044
0.066
0.039
0.044
0.064
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
710
R. R. Wilcox
A general concern about TBS is that as n increases, it is unclear that pn approaches 0.05.
And perhaps a more serious concern is that situations are encountered where pn increases
beyond 0.1 as n gets large. This is the case with p = 8 as indicated in table 1 and a similar
result was obtained with p = 12. The reason for this is unknown. As a partial check on
this result, the built-in S-PLUS function that computes Rocke’s TBS estimator was replaced
by S-PLUS code written by Prof. Rocke. But the same result was obtained. One feature
worth noting is that the algorithm relies on an initial estimate of location and scatter and
can be altered when using Rocke’s code. It was found that altering the initial estimate from
MVE to the weighted mean used by method MGV had a considerable impact on pn . But
regardless of which initial estimate was used, improved control over pn was not obtained, and
using the OGK estimator failed to correct this problem as well. Some additional simulations
were run with p = 5 and n = 200, now pn is approximately 0.025. So there is no indication
that a similar problem occurs with p = 5. But caution seems warranted and it would seem
that for p > 5, if the goal is to control the outside rate per observation, some other method
should be used.
Another issue has to do with the ability of a method to correctly identify points that are
truly unusual based on the model that generated the data. And a related issue is, as unusual
points are added to the data, what happens to the ability of a method to detect them? To
provide at least some results relevant to these issues, first consider the case p = 2, n = 20,
ρ = 0.9, and suppose an additional point is added at (1, −1). This point has Mahalanobis
distance 4.47, meaning that its distance from the origin is unusually large. Of interest is
the probability of a correct decision (PCD) regarding this point when applying the methods under study. The first line of table 2 reports the estimated PCD for this special case.
The second line reports the estimated PCD when there are two points added at (1, −1).
(So the notation (1, −1)2 indicates that two points were added at (1, −1).) Results are also
given when adding 3, 4, 5 and 6 points. For one or two points, method OGK and TBS
have a higher estimated PCD than methods MGV and OP. But as the number of outliers
is increased from four to five, method TBS performs in a relatively poor fashion, as does
method MGV. Note, however, that MGV(RMBA) performs relatively well. That is, in this
case, the choice for the measure of location used by the MGV method makes a practical difference. With five outliers, method OP is best with methods OGK MGV(RMBA) performing
reasonably well.
To complicate matters, the location of an outlier can affect the ability of a method to detect it,
even when its Mahalanobis distance is extreme. Consider again the case where a single outlier
Table 2.
Estimated PCD, n = 20, ρ = 0.9.
Point
OGK
TBS
(1, −1)1
(1, −1)2
(1, −1)3
(1, −1)4
(1, −1)5
(1, −1)6
(2.3, −2.3)1
(2.3, −2.3)2
0.99
0.93
0.82
0.65
0.42
0.22
0.18
0.17
0.99
0.98
0.95
0.84
0.18
0.17
0.28
0.14
(1, −1, 0, 0, 0)1
(1, −1, 0, 0, 0)2
(2.3, −2.3, 0, 0, 0)1
0.63
0.33
1.00
0.62
0.03
1.00
MGV
p=2
0.88
0.86
0.78
0.78
0.09
0.04
0.21
0.07
p=5
0.61
0.47
0.51
OP
MGV(RMBA)
0.78
0.66
0.54
0.54
0.54
0.03
0.26
0.34
0.98
0.94
0.85
0.69
0.35
0.04
0.17
0.08
0.20
0.12
0.30
0.61
0.51
0.42
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
Multivariate outlier detection techniques
711
is placed at (1, −1), only now another outlier is placed at (2.3, −2.3). This second point has a
Mahalanobis distance of 10.23, which would seem to be exceptionally large. But the estimated
PCD for this second point is only 0.18, 0.28, 0.21, 0.26 and 0.17 for methods OGK, TBS,
MGV, OP and MGV(RMBA), respectively. With three outliers at (1, −1), the estimates are
now 0.17, 0.14, 0.07, 0.34 and 0.13. So now, method OP is best.
Finally, the bottom portion of table 2 shows some results when p = 5, there is a single
outlier at (2.3, −2.3, 0, 0, 0), and there are one or more outliers at (1, −1, 0, 0, 0). The entries
in table 2 are the estimated PCD when trying to detect the outlier at (1, −1, 0, 0, 0). Now
methods MGV and MG(RMBA) perform as well or better than the other three methods, and
with three outliers at (1, −1, 0, 0, 0), they offer a distinct advantage. But in terms of detecting
the outlier at (2.3, −2.3, 0, 0, 0), they do not compete well with methods OGK and TBS. For
example, the last line in table 2 shows the estimated PCD with three points at (1, −1, 0, 0, 0)
and the goal is to detect the outlier at (2.3, −2.3, 0, 0, 0). As indicated, PCD is estimated to
0.51 using method MGV, versus 1.00 when using methods OGK and TBS.
To add perspective, when adding outliers to the data, the outside rate per observation among
the original n observations, generated from a multivariate normal distribution, was checked.
With p = 5 and n = 20 and two outliers at (1, −1, 0, 0, 0) and (2.3, −2.3, 0, 0, 0), the rates
for OGK, TBS, OP, MGV and MGV(RMBA) were 0.130, 0.229, 0.036, 0.058 and 0.028,
respectively. So both OGK and TBS are unsatisfactory; methods OP, MGV and MGV(MBA)
are preferable. Increasing n to 30, again with two outliers, the rates are now 0.063, 0.071,
0.007, 0.019 and 0.021. With n = 100, all five methods have estimated rates less than 0.02,
with methods OP, MGV and MGR(MBA) having estimates less than 0.01.
4.
Concluding remarks
No single method dominates and no single method is always satisfactory in terms of controlling
pn , the outside rate per observation. But for routine use, methods OP, MGV and MGV(RMBA)
seem best based on the estimated pn values reported here. However, for p ≤ 5, a case can be
made for using TBS, the main exception being p = 2 and n = 20. The simple strategy of using
the OGK estimator with β fixed was found to be unsatisfactory. Adjusting β, as suggested
in section 2, its performance improved considerably, but MGV and OP seem preferable in
general.
In terms of detecting true outliers, the situation is complex. In particular, the ability of a
method to detect an outlier depends on where the outlier is located relative to the entire cloud of
points and it can depend on how many outliers there happen to be. In some situations, method
TBS is a clear winner, but situations arise where the reverse is true. So if p ≤ 5, it would seem
that TBS should be given serious consideration both in terms of pn and its ability to detect
true outliers. But with p > 5, all indications are that methods OP, MGV and MGV(RMBA)
are preferable, with MGV(RMBA) performing a bit better than MGV, and even when p ≤ 5,
they can have a higher PCD than TBS.
References
[1] Rousseeuw, P.J. and Leroy, A.M., 1987, Robust Regression & Outlier Detection (New York: Wiley).
[2] Wilcox, R.R., 2005, Introduction to Robust Estimation and Hypothesis Testing (2nd edn) (San Diego CA:
Academic Press).
[3] Rousseeuw, P.J. and van Zomeren, B.C., 1990, Unmasking multivariate outliers and leverage points. Journal of
the American Statistical Association, 85, 633–639.
[4] Carling, K., 2000, Resistant outlier rules and the non-Gaussian case. Computational Statistics & Data Analysis,
33, 249–258.
Downloaded by [Rand R. Wilcox] at 13:53 10 August 2012
712
R. R. Wilcox
[5] Campbell, N.A., 1980, Robust procedures in multivariate analysis I: Robust covariance estimation. Applied
Statistics, 29, 231–237.
[6] Rousseeuw, P.J., 1985, Multivariate estimation with high breakdown point. In: W. Grossmann, G. Pflug and
W. Wertz (Eds) Mathematical Statistics and Applications, B. (Dordrecht: Reidel Publishing), pp. 283–297.
[7] Olive, D.J., 2004, A resistant estimator of multivariate location and dispersion. Computational Statistics & Data
Analysis, 46, 93–102.
[8] Hawkins, D.M. and Olive, D., 2002, Inconsistency of resampling algorithms for high-breakdown regression
estimators and a new algorithm. Journal of the American Statistical Association, 97, 136–147.
[9] Fung, W.-K., 1993, Unmasking outliers and leverage points: A confirmation. Journal of the American Statistical
Association, 88, 515–519.
[10] Rousseeuw, P.J. and Van Driesen, K., 1999, A fast algorithm for the minimum covariance determinant estimator.
Technometrics, 41, 212–223.
[11] Maronna, R.A. and Zamar, R.H., 2002, Robust estimates of location and dispersion for high-dimensional
datasets. Technometrics, 44, 307–317.
[12] Rocke, D.M., 1996, Robustness properties of S-estimators of multivariate location and shape in high dimension.
Annals of Statistics, 24, 1327–1345.
[13] Gnanadesikan, R. and Kettenring, J.R., 1972, Robust estimates, residuals and outlier detection with multiresponse data. Biometrics, 28, 81–124.
[14] Yohai, V.J. and Zamar, R., 1988, High breakdown point estimates of regression by means of the minimization
of an efficient scale. Journal of the American Statistical Association, 86, 403–413.
[15] Rocke, D.M. and Woodruff, D.L., 1993, Computation of robust estimates of multivariate location and shape.
Statistica Neerlandica, 47, 27–42.
[16] Olive, D.J. and Hawkins, D.M., 2006, Robustifying robust estimators. Preprint available online at:
www.math.siu.edu/olive/preprints.htm.
[17] Atkinson, A.C., Riani, M. and Ceriolo, A., 2004, Exploring Multivariate Data with the Forward Search
(New York: Springer).
[18] Pe˜na, D. and Prieto, F.J., 2001, Multivariate outlier detection and robust covariance matrix estimation.
Technometrics, 43, 286–299.
[19] Donoho, D.L. and Gasko, M., 1992, Breakdown properties of the location estimates based on halfspace depth
and projected outlyingness. Annals of Statistics, 20, 1803–1827.
[20] Rocke, D.M. and Woodruff, D.L., 1996, Identification of outliers in multivariate data. Journal of the American
Statistical Association, 91, 1047–1061.