Characterizations of an Empirical Influ

Transcription

Characterizations of an Empirical Influ
American Society for Quality
Characterizations of an Empirical Influence Function for Detecting Influential Cases in
Regression
Author(s): R. Dennis Cook and Sanford Weisberg
Reviewed work(s):
Source: Technometrics, Vol. 22, No. 4 (Nov., 1980), pp. 495-508
Published by: American Statistical Association and American Society for Quality
Stable URL: http://www.jstor.org/stable/1268187 .
Accessed: 13/11/2012 15:58
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association and American Society for Quality are collaborating with JSTOR to digitize,
preserve and extend access to Technometrics.
http://www.jstor.org
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
Thefollowing
at the24thAnnualFallTechnicalConference
oftheChemical
paperwas presented
Divisionof the ASQC and the Section on Physicaland Engineering
Sciences of the ASA in
Cincinnati,
Ohio,October23-24, 1980.
Characterizations
of an EmpiricalInfluence
Function
forDetectingInfluential
Cases
in Regression
R. Dennis Cook and Sanford Weisberg
School of Statistics
Universityof Minnesota
St. Paul, MN 55108
mostoftheeffort
infitting
fullranklinearregression
modelshascentered
Traditionally,
on
thestudy
ofthepresence,
andform
ofrelationships
between
themeasured
variables.
As
strength
is nowwellknown,
leastsquaresregression
canbe strongly
influenced
computations
bya few
modelmaymoreaccurately
reflect
unusualfeatures
cases,anda fitted
ofthosecasesthanthe
overallrelationships
between
thevariables.
Itisofinterest,
foran analyst
tobeableto
therefore,
findinfluential
cases and,basedon them,makedecisionsconcerning
theirusefulness
in a
at hand.
problem
Based on an empiricalinfluence
we discussmethodologies
forassessingthe
function,
influence
of individual
or groupsof cases on a regression
We concludewithan
problem.
theFloridaAreaCumulusExperiments
exampleusingdatafrom
(FACE) on cloudseeding.
KEY WORDS
Linearmodels
Distancemeasures
Robustness
Residualplotting
Outliertests
Cloudseeding
1. INTRODUCTION
The problemswe considerarise in the contextof
the linearmodel
Y=Xp+e
(1.1)
whereY is an n x 1 vectorofresponses,X is an n x p
matrixof known constants,p is a p x 1 parameter
vector,and e is an n x 1 vector of errors.Data
analyses based on this model usuallycenteron the
presence,formand strengthof relationshipbetween
the responseand independentvariables(columnsof
X). Estimation,hypothesistesting,model selection
and predictionare typicalconcerns.
Recently,interestin therole thateach observation
or case can play in an analysishas increased(here,a
Received
August1979;revised
April1980
case refersto a responseyi along withtheassociated
design point, or row of X). A case may be judged
influentialif importantfeaturesof the analysis are
alteredsubstantially
whenit is deletedfromthedata.
For example,deletionofone or morecases can result
in a substantialchange in the center,orientationor
volume of a confidenceellipsoid for unknownparameters.Individualor groups of cases can exerta
substantialinfluenceon the analysisand yetgo undetectedwhenthe residualsare examined.
An importantclass ofmeasuresof influence
can be
based on empiricalinfluencefunctions.Let p be the
usual least squares estimatorof p based on the full
data and let PA be an alternativeleastsquaresestimatorbased on a subsetofthedata. Then,theempirical
influencefunctionforp, IFA, is definedto be
= PA -
(1.2)
is
one
of
several
influence
IFA
functions
empirical
describedby Mallows (1975); see also Jaeckel(1972).
The values of IFA are comparable only withinthe
givendata setand model.Also,sinceIFA is a p-vector,
routine use of it as a diagnostic for isolating
influential
cases may be labored.Alternatively,
fora
matrixM and a nonzero
givenpositive(semi)definite
IFA
.
495
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
496
R. DENNIS COOK AND SANFORD WEISBERG
scalefactorc, IFA can be characterized
by thedis-
Theempirical
influence
function
(1.2)forp isgiven
by
tance,DA(M, c), betweenp and pA definedby
DA(M, c)-
(IFA)TM(IFA)
C
IFA = ((i) - p)
(1.3)
and thedistancemeasure(1.3) is given,forsomeM
and c, by
The matrixM can be chosen to reflectspecific
interests.
In someapplications,
measurement
oftheinfluence
=
of cases on thefitted
Y
values,
Xp,maybe more
than
influence
on p. For
appropriate
measuring
if
is
the
example, prediction
primary
goal it maybe
convenient
to workwitha reparameterized
model
wheretheregression
coefficients
are notofinterest.
The empirical
influence
function
forY is defined
as
Di(M, c) = (()
2. DELETING
ONE CASE AT A TIME:
DISTANCE MEASURES
themodel(1.1) withrank(X) = p and
Assuming
=
theleastsquaresestimator
a:2I,
of P using
Cov(e)
thefulldata is p = (XTX)- XTY and thefullsample
estimate
of c2 is2 = yT(I - X(XTX)- lX )Y/(n - p).
Let ridenotetheithresidual,
ri= yi- x[Tp,wherexf
is theithrowofX. Looselyspeaking,
caseswithlarge
as onesforwhichthemodel
rihavebeenconsidered
failseitherdue to incorrect
functional
formor becauseofan outlierin Y.
We shallneedadditional
notation.
A subscript
"(i)"
addedto a quantity
means"withtheithcasedeleted".
Thus, for example,X() is an (n - 1) x p matrix
derived from X by deletingthe ith row x,
p(i)= (X()X())- X(Y), and so on. Also of importance is the projection matrix V = (vij)X(XrX)-1XT, an n x n rank p matrixthatprojects
ontothecolumnspaceofX. The diagonalentries
vii
areofspecialinterest.
- O)TM(()
-
P).
(2.2)
tocases
"Large"valuesofDi(M,c) wouldcorrespond
when
result
in
movement
in the
that,
deleted,
large
estimate
ofP.Weshallcalla casewitha largevalueof
forestimating
Di(M, c) influential
p relativeto (M, c).
NaturalchoicesforM and c are (X X) and ps2
othersarepossible.Theresultrespectively,
although
suggested
byCook (1977),is
ingstatistic,
it is conX(IFA). Again,since X(IFA) is an n-vector,
venientto considerone-dimensional
characterizations.Clearly,
thedistancemeasureDA(M,c) maybe
as a normofX(IFA) providedM is chosen
regarded
to be oftheform
M = X BX,whereB canbe chosen
to reflect
interests.
specific
In thispaper,measures
oftheform
(1.3)arestudied
for alternative
estimators
obtainedby deletinga
singlecase (Section2), or by deletingseveralcases
wherelinearfunc(Section3), and also forproblems
tionsofp are ofinterest
(Section4). Nextweturnto
measuresbasedon volumesofconfidence
ellipsoids
ratherthan on distances(centers);theseare not
in termsof the empirical
directlyinterpretable
influence
functions.
After
a brief
discussion
ofcomputationsin Section6, themethodsof thispaperare
appliedto an exampleusingdata fromtheFlorida
AreaCumulusExperiment
(Woodleyetal.,1977)on
cloudseeding.
(2.1)
D,(XTX, ps2) = (pi) -)T(X
)(p(i)ps2
). (2.3)
Form(2.3)hasa useful
The
geometric
interpretation:
ofthedistancebetweenP and P(i)maybe
magnitude
assessedbycomparing
Di(XTX, ps2) to theprobabilitypointsofthecentralF withp and n - p degreesof
freedom.
This is equivalentto studyingthe least
squaresconfidence
ellipsoidsforp basedon thefull
theellipsoidthatpassesthrough
data,andfinding
P(i);
the F distribution
is used only to transform
scale.Also,(2.3)may
Di(XYX, ps2) to a morefamiliar
be rewritten
in theform(Bingham,
1977)
Di(XTX, ps2) =
(~(i)
- Y)
(t(i)
PS2
(2.4)
-)
thatDj(XTX,ps2) is,asidefromthescale
suggesting
factorps2, theordinary
squaredEuclideandistance
thatthe fittedvectormoveswhenthe ithcase is
deletedfromthedata.
A computationally
convenient
and revealing
form
forDi(XTX,ps2) is (Cook,1977)
(XX, p
=p 1
vii
(2.5)
whereti= ri/s(1 - vii)12 is theithStudentized
residual.Thus,thisdistancemeasureis theproductofa
random residual term, t2, and a fixed term,
thatthecorrect
modelisgiven
vii/(1- ii). Assuming
by(1.1)and undernormality,
Fi = t2(n- p - 1)/(n- p - t2)
follows an F-distributionwith 1 and n - -p
1
Theimportance
degreesoffreedom.
andusefulness
of
thet2'sand vii'sin an analysishavebeenstudiedby
Srikantan
(1961),Huber(1975),Cook (1977,1979),
Welschand Kuh (1977),HoaglinandWelsch(1978),
KuhandWelsch(1980),andWeisberg
Belsley,
(1980).
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
497
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
t2or vii
Clearly,
Di(XTX,ps2) can be largeifeither
is large.A casewitha largevalueofviiiscalleda high
case whilea case witha largevalueoft2is
leverage
calleda (potential)
outlier.
Characteristics
ofxTwhichcauseviitobe largecan
be seenas follows:
is the
thattheintercept
Assuming
denote the
model, let Pi < 2 < *... <? pofthecorrected
for
crossproductmatrix
eigenvalues
thedata, and P1, ..., Pp_1 denotethecorresponding
of
Then,bythespectral
eigenvectors.
decomposition
thecorrected
crossproductmatrix,
ii=-
((x
)
(2.6)
wherex is thevectorof sampleaverages.Further,
letting0,idenotetheanglebetweenP, and (xi - x) we
obtain
cos(0)
pT(xi- x)
= ((x, _)T(x,
-
(2.7)
))11
and
andWilk,1975;Cook,1979).Thusthe
vii(Gentleman
to
detect
outliersat unusualdesignpoints
ability
with
largevii)canbe muchlessthantheability
(cases
ofdetecting
outliersforcaseswithsmallervaluesof
vii. Yet, it is preciselythe pointswiththe lowest
potential
powerfortheoutliertestthatareusuallyof
the greatestinterest.However,if the Bonferroni
to get
inequalityis appliedto the F distribution
levelsfortheoutliertest,probability
of
significance
errorcan be apportioned
to casesunequally,
giving
smallercriticalvaluesto caseswithlargervaluesof
vii.One ruleof thistypewouldchoosethecritical
valuefortheithcaseto correspond
to theci = cxvii/p
probability
pointofF ratherthantheusualai = a/n
pointusedforall i. Thisunequalallocationofprobdistribution
abilitydoes not affectthe underlying
oftheteststatistic
as thevi are assumedto be fixed,
but in largesamples,it mayresultin a substantial
increasein powerfortheoutliertestforcaseswith
largevi.
Alternative
ChoicesforM, c.
V =
n
+ -(Xi-
)T(Xi - x)
1=1
#t
(2.8)
Thus,viiis largeif(1) xiis farfromx,thatis,itiswell
removedfromthe bulkof the cases,and (2) xi is
in a direction
ofan eigenvector
corresubstantially
ofthecorrected
toa smalleigenvalue
cross
sponding
productmatrix.On the otherhand,if (xi - x) is
of its direction.
small,viiwill be smallregardless
Contoursofconstantviiare ellipsoidscentered
at x
withaxesgivenbytheeigenstructure
of(XTX)- 1.
The vi also playan important
roleinthecharacter
oftheusualtestforoutliers
in linearregression.
The
usuallikelihood
ratiotest,undera mean-shift
model,
ofthehypothesis
thattheithcaseisnotan outlier,
isa
monotonicfunction
of the ithStudentized
residual
and thepowerofthistestis a decreasing
function
of
TABLE
1-D,(M,
c) = ((,i)-
)TM((i)-
M
c
Tx
ps
T
xTx
ps
X
X
(i)
[diag(XTX)
I
p)/c,for various M, c.
Reduced
2
2
2
(DFFITS)2,
Welsch and Kuh
(1977)
1 2
-t. v..
p I
2
Cook (1977)
v -
11
1F .v .
p 1
T
2
PS(i)
ps
Comments
n-p F
1ii
i 1-v.
p
2
PS(i)
-1]1
form
1 2 vii
-t
p i 1-vii
2
PS(1)
T
(i)
Table 1 includesseveralalternative
choicesforM
and c. Althoughno detailedcomparisonof these
alternatives
hasbeenmade,itislikelythatanychoice
forM, c withDi(M,c) location/scale
invariant
would
thesameinformation.
giveapproximately
However,
thegeometric
ofthedistancemeasure
interpretation
willvarywithM, c. For example,themeasurewith
M = X(i)X(, and c = psi)refers
to thedistancefrom
to
relative
to
the
confidence
ellipsoids
computed
P(i) p
without
theithcase,andcomputed
distances
arenot
as theyrefer
to different
metrics.
directly
comparable
In the sequel, only the distance measure
sincetheotherreasonDi(XTX, ps2) willbe discussed
able choicesfor(M, c) lead to essentially
analogous
statistics.
For simplicity
ofnotation,
weshallwrite
Di
forDi(XTX,ps2).
t.
T
i1
-1
x.(XX)
T
-1
(NDFBETAS)2
Welsch and Kuh
(1977)
x.
1-v..
ml
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
498
R. DENNIS COOK AND SANFORD WEISBERG
so that
3. GENERAL DISTANCE MEASURES FOR
SEVERAL CASES
The one at a time statisticscan be expected to
needed to
provide the majorityof the information
out
the
in
some
data sets
carry
analysis.However,
subsetsof cases can be jointlyinfluential,
but individually are uninfluential.Consider, for example,
Figure 1. If point C or point D is deleted,the fitted
model will change very little.If both are deleted,
however,estimatesof parametersmay show large
ifA or B is deleted,thefittedline
changes.Conversely,
will change; ifA and B are bothdeleted,thelinewill
stayabout the same.
Let I be an m-vectorof indicesthatspecifythe m
cases to be deleted.The subscript"(I)" will mean
"with the m cases indexedby I deleted",while "I'T'
withoutparentheseswill mean with only the cases
indexedby I remaining.For example,V, is an m x m
submatrixofV formedbytheintersection
oftherows
and columnsindexedby I, and r, is them x 1 vector
of residuals for cases indexed in I. The empirical
influencefunctionis
IFunction
D ) is-=
(3.1)
and thedistancefunctionDI(XrX, ps2) = D1 is
D
(), - p)T(XTX)(I -
,
=pS2
)
Cx
Dx
(3.3)
InsightintoDI can be obtainedby applyingthespectraldecompositionto VI. Thereis an m x mdiagonal
matrix A = diag(Al, ...,
.m) with 0 < Al <
.. < ,, < 1 and an orthogonalmatrixF, such that
V, = 'TAF.
(3.4)
For convenience,we will suppressall indicationthat
F and A depend on I.
If Am= 1, theninversesin (3.3) do not exist.If the
cases indexedby I are removed,theresulting
data are
rankdeficient,
and a unique P() does notexist.Therefore,ifAm= 1, we set D1 = oo. If ;Im < 1, (3.3) can be
writtenas
D
r(FTI - ITAF) F l TAF(I T' -
TAF)- r,
ps2
(I'r,)I-( - A)-'A(I - A)-'('r)
(3.5)
ps2
Lettingg = (g) = l'Fr,we obtain
D
(3.2)
(3.2)
The geometricinterpretation
ofD, is identicalto that
ofDi. An influential
subsetforestimating
p willcorrespond to a largeD1.
A convenientformulaforD, can be derivedfrom
the result(Bingham,1977)
= -(XTX) 1XT(I - V,)- r
-)
t
D r(I - V,)- 'V(I - V,)- r,
D,
ps2
.
gT(I - A)-'A(I - A)- g
PS2
m
Z
_
a
2
'2
g (1 _
)2
(3.6)
PS2
Each gl is a linearcombinationoftheelementsofr,,
and
Var(g) = Var(Fr,) = a21r T( - A)FFT = a2(I - A).
Thus, the g1 are uncorrelated with Var(g) =
2(1 - Al). Further,ifwe let
Dx
hI
(3.7)
s2(1 - bA)
then(3.6) may be rewritten
XXx
1
D
D p 1-
xXX
0
x
B
X
FIGURE 1. A and B are individually
butnotjointly.C
influential,
and D are jointlyinfluential
but not individually.
h2
Al.
h 1-
,
(3.8)
The hiare identicallydistributed.
The resemblanceof (3.8) to (2.5) is striking.The
role of the t2 is assumedby theh2and vii/(1- v) is
replaced by the Al/(1- A,). In (3.8), a sum over m
orthogonaldirectionsis required;in (2.5), m = 1.
A generalizationof the squared Studentizedresiduals to m cases is givenby r'(l - V,)- lr/s2 = E h2.
Under normaltheory(Gentlemanand Wilk,1975),a
likelihoodratio teststatisticforthe hypothesisthat
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
499
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
themcasesarenotan outlying
setcanbe computed
as
(n - p - m) E h2
h2
m(n-p-n
3
(39)
The nominaldistribution
of F, is F(m, n - p - m).
When m = 1, thechoice of viito representleverage
is reasonable,
sincefrom
Table 1,mostchoicesofM,
c suggesta monotonic
of vi as the
transformation
fixed
of
the
m
> 1,leverdistance
When
measure.
part
is
more
difficult
to
define.
definition
One
ofleverage
is
obtained
a2
with
and
s2
by replacing
taking
age
expectations in (3.6) to get E 2d/(l- 21)=
tr(V,(I- V1)-1). This definition,however, does
dependon M, and c. For themeasureD,(X(')X(J),
ps2), the analogous definitionof leverage is
E
A = tr(V[).Thesetwostatistics
arenotmonotonic
functionsof each other. While tr(V(I - V,)-1) is
Ifsufficient
is availableto store
computer
memory
theresidualvectorand all oftheelements
ofV,then
an efficient
based
in
the
on
Furnival
algorithm
part
andWilson(1974)methodforsubsetselection
canbe
written
tofindsubsetsofcaseswithlargevaluesofthe
outliertest statistic(3.9). However,an equivalent
forfinding
subsetswithlargeD, is not
algorithm
sincealtering
a subsetbyaddiimmediately
apparent,
or substitution
ofa case can resultin
tion,deletion,
substantial
of V,, and
changein theeigenstructure
henceinthevalueofDi. Evenso,complete
of
storage
V is usuallyimpractical
(butsee Section6), and realistictechniques
forfinding
influential
subsetsshould
use onlytheresidualsand thediagonalentries
ofV.
Usingonlythese,upperboundsforD[ canbe derived,
and onlyif theseare sufficiently
largemustD, be
computed
exactly.
For the firstupper bound, since Am/(1- im)2 >
A /(1 -_ )2, 1 = 1,2, ..., m,(3.6) can be approximated
attractive
becauseit corresponds
to our preferred
by
metric,
tr(V,)is also appealingbecauseit is easily
of V,
computedfromthe vii,and exactformation
is not required.Based on other considerations,
Draper and John (1979) suggest using
DB <
smallvaluesindicating
potentialleverage.
This,too,
is not a monotonic
function
of theothersuggested
ofa leverage
values,andchoicefortheform
leverage
statistic
maydependon thepurposeforwhichitisto
be used.
DI < (1 _ im)2
II - Vi| = n(1 - Al) as a measureof leverage,with
Lookingat theD,.
1
Am
m
n2
g2.
- m2 l=
ps (1
(3.10)
But E g2 = rTrr, = Ei e r2,and hence
ps2
-
(3.11)
For (3.11)to be useful,
Ammustbe replacedbyan
thatcan be computedwithoutneed
approximation
forforming
to use is
V,. The easiestapproximation
im < tr(V1),
<
1.
assuming
Thus,
tr( 1,)
One goal in examining
subsetsofm> 1 casesis to
findgroupsof cases that,whilenot individually
tr(V,)
I)
are influential
whentakenas a group.
influential,
/i
D, <
(1 tr(V,))2 ps2
subsetswhichincludecases that
Findinginfluential
areindividually
influential
mayadd littleinformation or,equivalently,
becausetheobservedinfluence
ofthesubsetwillbe
ofthesingleinfluential
due,in part,to theinfluence
E Vii E r2
case.Conversely,
an uninfluential
subsetthat
finding
(3.12)
includesone or morecasesthataresingly
influential
I1-E Vii P
\
i e I
wouldnotdecreasetheinterest
in thosecases.Thus,
good candidatesforinclusionin subsetswill have
Approximation
(3.12)dependsonlyon theone at a
small distancevalues form = 1, but theymay well
timestatistics
and providesa potentially
different
haverelatively
largevaluesofviior t2.
upper bound for each I. For any subset with
it maybe desirableto considerthe
Alternatively,
to Am
is required.
tr(Vi)? 1,a betterapproximation
thattheindividual
cases in an influential This,inturn,
possibility
If
requires
V,. mis small(2 or3)
finding
subsetare related(e.g.by time,order,etc.).In this
exactcomputation
of D, is probablyas efficient
as
situation,
good candidatesforinclusionin subsets
a
better
forAm.
obtaining
approximation
willincludeindividually
influential
cases.
For a fixed m, let T= max( iE vii) and
when
the
influence
of
isjudgedon
subsets
Finally,
R2 = max i, i ri2,whereI variesover all subsetsof
internal
scaling,lowerordersubsetsshouldperhaps
sizemunderconsideration.
Twoupperboundsforthe
beignored,
as itis therelative
rather
thantheabsolute
rightsideof(3.12)are then
behaviorof D1 that is important.
For additional
commentson the use of internalvariability,
see
tr(V,) R2
<
(3.13)
DI (1 - tr(V,))2 pS2
Kuh and Welsch(1980,p. 29).
Belsley,
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
R. DENNIS COOK AND SANFORD WEISBERG
500
and,ifT < 1,
D <
DI
T
r2 '
(3.14)
(I - T)2 ps2
Theselasttwomaybe combinedto give
T
R2
Di (1 - T)2 ps2
(3.15)
Clearly,(3.12) < (3.13) < (3.15),and (3.12) < (3.14) <
areexact.
(3.15).Ifm= 1,all fourapproximations
subsetswith
all relevant
An algorithm
forfinding
fixedmcan be basedon theseapproximations.
First,
in subsetsof size smaller
cases thatare influential
thanm maybe eliminated.
Then,theremaining
vi
and r2 are ordered,largestto smallest.The four
inequalitiescan then be applied to subsetswith
tr(V,)< 1 intheorder(3.15),then(3.14)or(3.13),and
is requiredif(3.12)
finally
(3.12).Exactcomputation
is largerthana selectedcutoff
point.Byconsidering
subsetsaccording
to theorderedlistsofviiandr2,the
subsetsthatare morelikelyto be influential
areconsideredfirst,and, once one of the bounds is
subsetsmadeupofcases
small,no further
sufficiently
in
lower thelistsneedto be considered.
Generally,
thismethodwillbe usefulin data setswithn large
relative
to p,wheretr(V,)willusuallybe lessthan1.
In smallerdata sets,relatively
moresubsetsmustbe
ofthe
Theseresults
couldbe improved
byrefinement
algorithm.
4. LINEAR COMBINATIONS
In thissection,
to
weextendtheprevious
discussion
accommodatethe situationin whichq linearly
combinations
oftheelements
ofP areof
independent
interest.
Thismaybe desirablewheninterest
centers
on a selectedsubsetofp. Also,onceinfluential
cases
have been foundusingD, it maybe desirableto
isolatetheireffects
on thecomponents
ofp.
Let 4 = Lp,whereL is a q x p rankq matrix.
The
4 and (l)= LP(I),is defined
distanceDI(+'), between
to be
DI()
~
-
=
(,)TF[L(XTX)
-
r]-
qs2
))
(4.1)
Thisis a specialcaseofthedistance
function
D,(M, c)
obtainedby choosingc = qs2 and
M = LT[L(XTX)- 'L-
'L.
(4.2)
of
Bingham(1977)has shownthatthenumerator
DI(*+) can be writtenas
qs2D,(+) = r (I - V,)- X,(XTX)-1
x M(XrX)- lX(I - V,)- r, (4.3)
considered.For example,form = 2 and m = 3, and
fortwo data sets,one withn = 21, p = 10 (discussed
in Section7) and the otherwithn = 125,p = 4, the
where M is given by (4.2). Apparently,
further
is
not
without
additional
consimplification possible
resultsof a simplealgorithm
are summarized
in
straints
on L. However,
directcomputation
ofDI(M)
Table2. The number
ofsubsetsis lessthanthetotal
will probablybe unnecessary
for most I: Since
numberofpossiblesubsetsbecausecasesinfluential (XTX)- 1/2M(XTX)- 1/2is idempotent,
itfollows
from
in subsetsofa smallersizewerenotconsidered
as m
<
that
(4.3)
qs2D,(*) ps2Di and, therefore,
was increased.
Whileindata set 1 littleis gainedby
use of the inequalities,
in data set 2, withlargen,
DI(*) < P D,
(4.4)
substantialdecreasein computation
is apparent.
q
TABLE 2-Computationsusingupperbounds.
Data Set 1
n = 21 p = 8
Data Set 2
n = 125 p = 4
m= 2:
Numberof subsets considered
155
7503
Numberof applications
153
651
74
5
of (3.12)-(3.14)
Numberof DI computed
m = 3:
Numberof subsets considered
560
302,621
Numberof applications
560
74,802
Numberof DI computed
of (3.12)-(3.14)
520
727
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
501
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
forall I and 4t.Thus,ifD, isnegligible,
DI(4t)mustbe
also.
The
result
was
mentioned
negligible
by
(4.4)
determinant
oftheappropriate
crossproductmatrix,
and hencethelogarithm
oftheratioofthevolumeof
(5.1)to (5.2)is
Subsetsof P
log
log Vol(E())
Cook (1979) forthecase m = 1.
lossofgenerality,
thatthelastq
Suppose,without
of
are
of
and
components P
interest,
partition
= (XTX)-1 - ((Xlxl)
=
0)
log(
log IXr)X(
Ix/2
xTx 11/2
js2 F(1-xa,p,n-p)
\p/2
x
s()F(1 -cx, p, n-p 1)i
X = (X1, X2) whereX1 is n x (p - q) and X2 is n x q.
Thus L = (0, Iq), and
(XTX)-M(XTX)-1
=
2
log(1 - vii)
p2
+2
(4.5)
whereM is givenby(4.2).Substitution
in(4.3)yields
qs2DI()
= ps2Di
- V)-U(I
-r
- V,)- r,
where U = (uij) = X(XTX1)- 'XT.
as
(4.6)can be written
qs2DI()
= r[(I
-
(4.6)
Alternatively,
V)- 1
U)(I - V)- 1r,.
Thus, whena singlecase is deleted(m = 1)
x (V-
Di()t2
-u
(4.7)
(4.8)
q 1-vii
Theinfluence
ofa singlecaseona selected
subsetofI
be determined
fromtheresultoftwo
maytherefore
on thefulldata.
separateregressions
5. CHANGE IN VOLUME OF CONFIDENCE
ELLIPSOIDS
Thus far,the measureof influence
of the cases
indexedbyI on theregression
hasbeentheempirical
influence
function.
existfor
Manyotherpossibilities
theimpactofthesecaseson theregression.
measuring
In this section,we considermeasuresbased on
changesin volumeof a confidence
ellipsoid.We
derivea volumemeasureforp whendeleting
a single
case,and thengivethemoregeneralresult.
The (1 - cx)x 100% normaltheoryconfidence
interval
forp,basedon thefulldata,is givenbyall
pointsp* in theellipsoid
{Eo:p (P* -
)(X X)(p* -)
< ps2F( - ; p, n-p)}
(5.1)
whereF(1 a; n1, n2) is the(1 a) x 100%pointof
an F distribution
with(n1,n2) degreesof freedom.
Thisellipsoid,
aftertheithcase is deleted,
is
{g(i,,'* I(P*--
()
)X())(P
0-,)
M
< pSi)F(1-a,
p, n-p-
1)}.
(5.2)
As is wellknown,thevolumeofan ellipsoidis proportionalto the inverseof the squareroot of the
n-p-t
in-t/
-1
pF(1
F(1-;
; p,n-p
p, n - p
)|
(5.3)
theratioofF values,(5.3)is equivalent
to
Apartfrom
thestatistic
COVRATIO givenin Belsley,Kuh and
Welsch (1980). This latterstatistic,however,is
motivated
as a ratioofdeterminants,
notas
directly
a ratioofvolumesofconfidence
ellipsoids.
Ifthisquantity
is largeand positive,
thendeletion
oftheithcase willresultin a substantial
decreasein
volume.Deletionofan outlier,
forexample,
maygive
thisresult.On theotherhand,if(5.3) is largeand
thecase willresultina substantial
negative,
increase
involume.Thisoccursifviiis near1. For mostcases,
(5.3)shouldbe nearzero.
Moregenerally,
therelative
volumesforestimating
J,a subsetof q components
of thep vector,after
themcasesindexedbyI, is foundto be
deleting
log Vol(Eo, ,i) =
vol(E(/),
2
log I-VI
og -ii
uI
lg n-p-m
F(1-x;
1?n
\ - p - hF(1 -a;q,
\
q,n-p)
n - p - m) (5.4)
whereU isdefined
following
(4.6),andhiisdefined
by
(3.7).
In particular,
sincetheintercept
is notgenerally
of
interest,the ratio, with m = 1 and q = p-1
(<-(f1'-*., * fie-1)r),is
log Vol(EO)) =
log 1-(1/n)
I -a;p+P
-llog(E -p-F
(1
l,n-p)
2
+
og
p-t2F(1-a;p-1ln-p-1))
(5.5)
Thislastformis recommended
forgeneraluse.
An alternative
volumemeasurewas suggested
by
Andrews and Pregibon (1978). They define
X* = (X Y), and suggest
lookingat theratio
I
)
-()
RI(X*)-R((X*)
X*
Ix*T*
x*TXI
(5.6)
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
502
R. DENNIS COOK AND SANFORD WEISBERG
whichcorrespondsto theproportionin totalvolume
generatedby the data that is not due to the cases
indexedby I. "Distant" or unusual sets of cases will
tendto have RI(X*) smallwhilecases in themiddleof
the data will account forlittlevolume. Unlike the
earlier volume measures discussed in this section,
ofthe
R,(X*) is invariantwithrespectto specification
responsevariablefromamongthecolumnsofX*,and
refersto a p + 1 dimensionalgeometry,while the
earliermeasuresare generatedwithspecificreference
to Y and to thep dimensionalgeometry
ofthecolumn
space of X. Draper and John(1979) providea comparisonof R,(X*) and DI.
6. COMPUTATIONAL
CONSIDERATIONS
Depending on available computerstorage,either
the Choleskyfactorization
ofX X or a QR factorization of X can be used to obtain submatricesof V.
With minimalstorageavailable,we can finda p x p
rank p upper triangular matrix R such that
XTX = RTR. Then since vii= xT(XTX)- xj, we have
ij= (R- xi)(R-
x).
(6.1)
ExplicitinversionofR can be avoided by solvingthe
systemxi = RTciforci by back substitution.
Then,vi
can be computedas an innerproduct,vij= cTcj.
With more storageavailable,we can findin addition to R an n x p matrixwithorthogonalcolumns
Q1 such thatX = Q1 R. It followsimmediatelythat
V = Q1 Qf, so vijcan be foundas the innerproduct
of theithand jth rows of Q1. Thus, Q1, requiringnp
in V,
storagelocations,containsall the information
whichrequiresn(n+ 1)/2storagelocations.In addition,computationof uii in (4.8) is easily done ifthe
columns of XI (as in Section 4) are the firstp - q
columnsofX. Then ui is just thesquarednormofthe
firstp - q columnsof the ithrow of Ql.
CompletediscussionsoftheCholeskyfactorization
(to obtain R) and theQR factorization(to obtain R
and Q1) are givenby Stewart(1974). In addition,the
LINPACK library(Dongarra et al., 1979) includes
well-documented
subroutinesforthem.
All of the statisticsstudiedin thispaper are functionsoftheresiduals,theviu(and perhapsalso uij)and
usual regressionsummariessuchas theresidualmean
square. For example,DI can be computedby first
findinga = (I - V,)- lr. Then, D, is computedas a
quadraticform,D, = aTV,a/ps2.
7. THE 1975 FLORIDA AREA CUMULUS
EXPERIMENT (FACE)
Judgingthe success of cloud seedingexperiments
intendedto increaserainfallis an importantstatistical
problem.Resultsfrompast experiments
are mixed.It
is generallyrecognizedthat,dependingon various
environmental
contributing
factors,seedingcan proTECHNOMETRICS
?,
VOL. 22, NO. 4, NOVEMBER
duce an increaseor decrease in rainfall,or have no
effect.Moreover,the criticalfactorscontrollingthe
responseare, forthe most part,unknown.This fundamental treatment-unit
makes judgnonadditivity
mentsabout theeffects
ofseedingdifficult
(Cook and
Holshuh,1979).
The 1975 Florida Area Cumulus Experiment
(FACE) was conductedto determinethe meritsof
using silveriodide to increaserainfalland to isolate
some ofthefactorscontributing
to thetreatment-unit
nonadditivity
(Woodleyet al., 1977; see also Bradley,
Srivastavaand Lanzdorf,1979). The targetconsisted
of an area of about 3,000 square miles to the north
and east ofCoral Gables, Florida.In thisexperiment,
24 days in the summerof 1975 werejudged suitable
forseedingbased on a daily suitabilitycriterionof
S - Ne > 1.5, whereS (seedability)is the predicted
difference
betweenthemaximumheightsofa cloud if
seeded and the same cloud ifnot seeded,and Ne is a
factorwhichincreaseswithconditionsleadingto naturally rainy days. (For a more detailed description
see Woodley et al., 1977). Generally,suitable days
werethoseon whichtheseedabilitywas largeand the
naturalrainfallearly in the day was small.On each
suitableday,thedecisionto seed was based on unrestrictedrandomization;as it happened,12 days were
seeded and 12 wereunseeded.
The followingvariables were measured on each
suitableday:
Echo Coverage (C):
Percentcloud cover in the experimentalarea,
measuredusingradar in Coral Gables, Florida
Prewetness(P):
Total rainfallin the targetarea one hourbefore
seeding(in cubic metersx 107)
Echo Motion (E):
A classificationindicatinga movingradar echo
(1) or a stationaryradar echo (2)
ResponseVariable (Y):
The amountofrainthatfellin thetargetarea for
a six-hourperiodon each suitableday (in cubic
metersx 107).
The data as presentedby Woodley et al. (1977) are
reproducedin Table 3. We have also includedthe
variable
Time Trend (T): Numberof days afterthe firstday
of theexperiment(June16, 1975 = 0).
This variable is potentiallyrelevantbecause there
may be a time trend in natural rainfall or
modificationin the experimental
techniques.
In addition to selectingdays based on suitability
(S - Ne), the investigatorsattemptedto use only
days with C < 13 percent. A disturbedday was
1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
503
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
TABLE 3-Datafrom Florida Area CumulusExperiment,
1975 (source: Woodleyet al., 1977).
CASE
A
T
1
2
3
4
5
6
7
8
0
1
1
0
1
0
0
1
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0
0
0
1
1
1
0
1
1
0
0
1
0
1
1
0
1
0
3
4
6
9
18
25
27
28
29
32
33
35
38
39
53
55
56
59
65
68
82
83
S
1.75
2.70
4.10
2.35
4.25
1.60
1.30
3.35
2.85
2.20
4.40
3.10
3.95
2.90
2.05
4.00
3.35
3.70
3.80
3.40
3.15
3.15
4.01
4.65
definedas C > 13. From Table 3, thefirsttwoexperimentaldays are disturbedwiththesecondday being
highlydisturbed(C = 37.9 percent).Thus it can be
and the
anticipatedthat case 2 may be influential,
underthe conditions
processunderstudymay differ
of case 2. Therefore,
case 2 will be deletedfromthe
of includingcase 2 will
primaryanalysis.The effects
be presentedlater.
Initially,we shall adopt the model
L Y= f3+ I A + 2-T + ,3(S - Ne)
+ f4C + Pi5LP + &f6E
+ 1f13(Ax (S- Ne)) + 314(Ax C)
+ ,31(A xLP) +
16(AxE)
(7.1)
whereLY = logI0 Y and LP = logI0 P. This model
containsall linear termsand all cross productsbetween action (A = 1 for seeded days, A = 0 for
unseededdays)and thebase variables.The crossproduct termsare to model the possibilityof treatmentunit nonadditivity.
Because of the limiteddegreesof
freedomavailable, higherorder termsin the base
variableshave not been included.
The main goal of our analysis is to describethe
AL Y, betweenpredictedrainfallforseeded
difference,
days and unseededdays,
ALY= LY(A = 1)-LY(A
= 0)
= fi + f13(S - Ne) + 3i14C
+ f1s5LP+ P16E.
Thus,thesubset*T|= (If, fI3,
(7.2)
f14, f15, 116) is of
primaryinterest.
Table 4 gives the least square estimatesand their
estimatedstandarderrorsforthe coefficients
in (7.1)
C
13.4
37.9
3.9
5.3
7.1
6.9
4.6
4.9
12.1
5.2
4.1
2.8
6.8
3.0
7.0
11.3
4.2
3.3
2.2
6.5
3.1
2.6
8.3
7.4
p
.274
1.267
.198
.526
.250
.018
.307
.194
.751
.084
.236
.214
.796
.124
.144
.398
.237
.960
.230
.142
.073
.136
.123
.168
E
2
1
2
1
1
2
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
SA
0
2.70
4.10
0
4.25
0
0
0
0
2.20
4.40
3.10
0
2.90
2.05
0
0
3.70
0
3.40
3.15
0
4.01
0
CA
PA
EA
y
12.85
5.52
6.29
6.11
2.45
3.61
.47
4.56
6.35
5.06
2.76
4.05
5.74
4.84
11.86
4.45
3.66
4.22
1.16
5.45
2.02
.82
1.09
.28
0
0
0
37.9
3.9
1.267
.198
1
0
0
0
1
0
7.1
.250
0
0
0
0
0
0
0
0
5.2
4.1
2.8
0
.084
.236
.214
0
3.0
7.0
.124
.144
0
0
0
0
3.3
0
6.5
3.1
0
8.3
0
.960
0
.142
.073
0
.123
0
2
0
0
0
1
1
1
0
1
1
0
0
1
0
2
1
0
1
0
as well as the estimatefor a few selected subset
models to be discussedlater.
Case Analysis:Full Model
Table 5 givesri, ti, vii,F1/2 (see 3.9),Di and Di(i)
for the full model withoutcase 2. The largesttwo
values of each statisticexcept vii listed in Table 5
correspondto cases 7 and 24, both unseeded days
withunusuallylow rainfallrecorded.The values of
F 12 for i= 7 and 24 and the Studentizedresidual
plot (Andrewsand Pregibon,1978) givenin Figure2
suggest that these cases do not conformto the
assumedmodel.Using theBonferroni
inequalitywith
equal allocation of probabilityor the allocation
methodof Section 2, the p-value forcase 7 is near
0.05. The mostlikelycandidatefora pair ofoutliersis
(7, 24) and theassociatedF-statisticcan be computed
to be 21.21 on 2 and 10 degrees of freedom.The
Bonferronip-valueusing eithermethodis near 0.06.
Although(7, 24) is evidentlyan outlierpair, it has
littleinfluence
on theleastsquaresestimateofp or qi,
D(7, 24) = 0.455, and removalof (7, 24) willmovep
onlyto theedge of a 10% confidenceellipsoid.
As an alternativeto deletingcases 7 and 24 (thatis,
modelingthemwithindicatorvectors),we could considerattempting
to expandthemodelto includeadditional termsin the base carriers.For example,if a
variable (S - Ne)2 is added to the model,thenthe
residualsforboth cases 7 and 24 become relatively
small. However,the influencemeasureforthesetwo
cases on theparameterestimatefor?q*= {(S - Ne)2}
is D(7, 24)(O*) = 19.71,whichsuggeststhatincluding
thisvariablehas littleeffectotherthanprovidingan
alternativemodel forcases 7 and 24. While in this
problemwe preferto deletethesecases,in otherproblems addinganothervariablemay be preferable.
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
R. DENNIS COOK AND SANFORD WEISBERG
504
TABLE 4-Estimated coefficients,
standarderrorsand rootmeansquarederror(RMSE)for selecteddata
sets and models.
Cases
Coefficient
(2)
Deleted
(2,7,24)
(2,7,24)
(2,7,24)
(7,24)
-0.291a
(0.498)b
0.417
(0.400)
0.492
(0.129)
0.436
(0.145)
0.400
(0.142)
2.244
(0.819)
1.426
(0.510)
1.294
(0.190)
1.381
(0.216)
1.458
(0.206)
-0.009
(0.003)
-0.006
(0. 002)
-0.007
(0.001)
-0.006
(0. 001)
-0.006
(0.001)
0.136
(0.114)
0.006
(0.085)
0.025
(0.028)
0.030
(0.015)
0.022
(0.010)
0.028
(0.012)
0.030
(0.012)
0.436
(0.266)
0.341
(0.146)
0.399
(0.083)
0.379
(0.087)
0.357
(0.085)
0.573
(0.261)
0.265
(0.135)
0.301
(0.074)
0.295
(0.074)
0.293
(0.075)
-0.465
(0.178)
-0.333
(0.107)
-0.326
(0.052)
-0.319
(0.053)
-0.309
(0.053)
14 (AxC)
-0.011
(0.057)
-0.023
(0. 028)
*
-0.021
(0.024)
-0. 045
(0.012)
~15 (AXLP)
-0.049
(0.443)
0.073
(0.224)
*
*
16 (AXE)
-0.291
(0.354)
0.050
(0.178)
*
*
RMSE (o)
0.291
0.139
0.122
0.123
B1(A)
3 (S-N e)
B13 (AXS-Ne)
*
*
*
*
0.124
a
Estimated coefficient
b
Estimated standard error
Term omitted from computations
2
-
t.
x
_
x
<
0
x
x
xx
x
x
x
x
-I
-2
x
X ---_X__-_
-X---.X --------------
x
--
-
_
x
x
x
-
.
X24
*II,,
-3
-0 .25
0.00
I
l
I 1
0.25
I I
l
0.50
? iI
I v I I 1v
0.75
1.00
Yi
FIGURE 2. Studentized
residual
fullmodel.
plot,case2 deleted,
The most influentialpair of cases is (3, 20),
D(3, 20) = o; whenthesecases are removedthemodel
becomes rank deficientand a unique least squares
estimateof/, does not exist. The deficiencyarises
because A and E x A are identicalaftercases 3 and
20 are removed.Althoughthese cases are jointly
influential,they are individuallyuninfluentialand
thereis no apparentreason to doubt theirauthenticity.No otherpairhas a seriousinfluenceon p or j.
At thispoint we choose to delete the pair of suspectedoutliers,(7, 24). The second column of Table
4 gives the estimatedcoefficients
forthe fullmodel
without(2, 7, 24). The univariatecase statisticsfor
the fullmodel without(2, 7, 24) are summarizedin
Figures3, 4 and 5. The residualplot is well-behaved
and the univariatecase statisticsrevealno problems
or anomalies.Inspectionofthepairwisecase statistics
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
505
TABLE 5-Univariate case statistics
for thefullmodelwithcase 2 deleted.
I
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
revealsno joint outliersand, of course,(3, 20) is still
the mostinfluential
pair. However,in addition,there
is a second pair which is highly influential,
= 9.0. Individuallythese
D(4, 16) = 7.1 and D(4, 16)(4)
cases are uninfluential.The high joint influence
appearsto be theresultofa largeresidualcorrelation,
+ 0.89. Since theresidualcorrelationis positive,cases
4 and 16 probablylie on oppositesides ofthecenter
of the data, nearlyon the same ray (Cook, 1979).
x
:^~
ti1
I
-:
xx
x
X
X
0
-I
x
xx~~
x
x
_X
~x~x
' X
x
.x
X
-X
_,
I
0.0
I
0.2
I
I
0.4
,
I
0.6
,
I
0.8
.
I
1.0
1
.369
.699
.863
1.423
1.213
3.916
1.370
.587
.880
.452
.338
.475
.148
.481
.326
.606
.698
.040
.699
.527
.597
.704
3.512
-.383
-.714
.872
-1. 366
1.190
-2.643
1.322
.604
-.889
.468
-.352
.492
.154
.497
.339
-.622
.713
.042
.714
.543
.614
.720
-2.519
-.072
-.124
.211
-.258
.158
-.510
.343
.138
-.204
.110
-.090
.121
.039
.094
.079
-.086
.081
.010
.124
.109
.149
.121
-.541
(B51 B13' 15,
14'
2
1/2
ti1
ri
CASE(i)
,
1.2
FIGURE 3. Studentized
residualplot,cases2, 7, 24 deleted,
full
model.
V.
.
11
.580
.646
.307
.578
.793
.560
.208
.386
.379
.354
.234
.286
.256
.583
.365
.777
.848
.261
.646
.529
.301
.669
.455
D.
1
.018
.085
.031
.232
.492
.810
.042
.021
.044
.011
.003
.009
.001
.031
.006
.123
.257
.000
.085
.030
.015
.095
.482
D. () ~
1
.008
.052
.018
.206
.357
.737
.050
.009
.052
.010
.004
.008
.001
.050
.004
.144
.363
.000
.099
.043
.014
.116
.380
16)
While removingthese cases may lead to different
fordoubting
conclusions,no real needorjustification
the usefulnessof these two cases is apparent.Our
strategyis to leave themin forfurther
analyses,and
continueto monitortheirinfluence.
Final Model (Cases 2, 7 and 24 Deleted)
The final model was selected by calculatingall
possible subset regressions(Furnival and Wilson,
1974) and examiningthe fewwithsmallestMallows'
Cp. Two finalmodelswerechosen.The thirdcolumn
of Table 4 gives the estimatedcoefficients
for the
modelwe consider"best"and thefourthcolumngives
those fora possible second choice. The two models
differby the presenceof the A x C (/,14) term.The
estimatesof the coefficients
in common to the two
models are quite close. Also, IP141 is less than its
standarderrorand mightbe judged unnecessary.
The univariatecase statisticsand residualplotsfor
the "best"model all appear wellbehaved.The bivariate cases' statisticswerealso inspectedand again no
problemswerenoted.In particular,(3, 20) and (4, 16)
are no longerinfluential.
As a checkon theinfluence
of (3, 20) and (4, 16), the best 5 models usingthe Cp
criterionwere computedwithoutvarious combinations of these points; our final model was always
thesepairsofcases have
amongthebest 5. Evidently,
littleinfluenceon the termsin the finalsolution.
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
506
R. DENNIS COOK AND SANFORD WEISBERG
1.0
0.8
V.i
0.6
0.4
0.2
0
5
10
Index
15
20
25
20
25
FIGURE 4. Index plot of vii,fullmodel,cases 2, 7, 24 deleted.
0.4
0.3
Di1
0.2
0,1
0,0
0
5
10
15
Index
FIGURE 5. IndexplotofDi, fullmodel,cases2, 7,24 deleted.
In the finalmodel,the estimatedpredicteddifference ALY (see 7.2) containsthe seedingeffectand
onlyone of the fourpossible interactionterms,
AL Y = 1.29 - 0.33(S - Ne).
(7.3)
The coefficient
of the seeding suitabilitycriterion,
in
S - Ne, is negativeand the predicteddifference
rainfalldecreasesas S - Ne increases.Accordingto
thisresult,seedingproducesa decreaseinrainfallwhen
S - Ne > 3.91. In short,contraryto theexperimenters'prioropinion,thereis evidenceto suggestthat
criteoptimalseedingoccursiftheseedingsuitability
rion is low!
Case 2
Recall thatcase 2 was deletedat the outseton the
groundsthat it would probably be influentialand
thatit may not conformto theprocessunderstudy.
The F-statisticfortestingthefitofcase 2 to thefinal
modelofSection7.3 has thevalue 16.00with1 and 16
degreesof freedomand theassociatedp-valueis less
than0.001. However,it is possiblethatcase 2 can be
explainedby one of the deletedterms.To check on
this possibility,the previous analysis was repeated
withcase 2 included.
All qualitativeconclusionsreachedin the analysis
withoutcase 2 remain valid with case 2. Also, as
For example,in
expected,case 2 is highlyinfluential.
thefullmodelafterdeletingtheoutlierpair (7, 24) the
distancemeasureforcase 2 is D2 = 3.25.
The primarydifference
betweenthetwo analysesis
in the finalmodels.The last columnofTable 4 gives
the estimatedcoefficients
forthe model judged best
usingCp whencase 2 is present.A comparisonof the
last threecolumnsof Table 4 suggeststhatcase 2 is
influential
foronlythe A x C term.This is confirmed
by the distancemeasuresforthe subsets (/14) and
thefinalmodel,
14= (O, Pi , 2,p 4, 6, /13) from
=
=
4.15
and
0.57.
The
A x C termis
D2(1l)
D2(/14)
neededto modelcase 2 only.The predicteddifference
fromthe finalmodel withcase 2 is
ALY = 1.46 - 0.31(S - Ne) - 0.045C.
(7.4)
This suggests,in addition to previousconclusions,
that the effectof seedingdecreaseswith increasing
cloud cover.Admittedly,
thisconclusionseemsodd.
Finally,the fullmodel withcase 2 was fitusing
Huber's proposal 2 (1973) robustestimatorwith a
varietyof truncationpoints.The scale was chosenas
median [Iril/0.6745]and Bickel's proposal 2 (1975)
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION
was used as the steppingmethod with Andrews'
medianestimate(1974) as a startingvalue. After15
iterationswithtruncationpoint 1.0,cases 7, 8, 15 and
24 are given weightsless than 1 (when the robust
is viewedas an iteratively
methodof fitting
weighted
least squares method).These fourcases have relativelylarge residuals,and includethe two presumed
outlyingcases. Case 2, however,is givenfullweightin
thisrobustfit,and it should be no surprisethatthe
estimatedchange in rainfallclosely approximates
(7.4) ratherthan (7.3). When case 2 is deleted,the
resultthenmorecloselyresembles(7.3). Thus,robust
methodsthatare relatively
insensitive
to outliersmay
still depend on cases thatare influential
because of
highleverage.
8. DISCUSSION
AND SUMMARY
Cases are termedinfluential
ifimportantfeaturesof
the analysis are alteredsubstantiallywhen theyare
deleted. Individual or groups of cases can be
influentialand yet go undetectedduringthe usual
analysisof residuals.Thus,it is necessaryto consider
additionalmethodologiesforisolatingsuch cases.
The empiricalinfluencefunctionsfor p and Y as
definedheremeasuretheinfluence
thatspecifiedcases
have on an analysis.However,because these functions are multidimensional,
it will usuallybe necessaryto considerlowerdimensionalcharacterizations.
The characterizationsand volume ratios presented
here can effectively
isolate influentialcases and are
aids to understanding
theunderlying
causes.In particular,D, is a usefulomnibusmeasureof influence.
Once a subsetof influential
cases has beenisolated,
it is desirableto investigate
thecauses byconsidering
the componentsof D,, h2 and Al(cf.(3.8)). A useful
summaryofthesecomponentsis providedby 2Eh2and
ZA./(1- Al),althoughothereffective
summariesare
possible.In additionto consideringthe components
ofDi, itmaybe usefulto investigate
further
theeffects
of the influential
cases by consideringcomponentsof
P usingD,(+).
Cases can be influential
becausetheycorrespondto
outlyingresponses,remotepointsin thefactorspace
or, perhaps,a combinationofthetwo.The judgment
that a case is influential
does not necessarilyimply
thatit shouldbe deletedor down weighted,although
thismay be an attractiveoptionifthecorresponding
Studentizedresidualis large.
If a case is influential
because it is remotein the
factorspace,thenit could be themostimportantcase
in thedata sinceit mayprovidetheonlyinformation
in a regionwherethe abilityto take observationsis
limited.Alternatively,
such a case mightbe deletedif
it is believedthatthemodelfitto thebulkofthedata
is not appropriatein a neighborhoodof the case in
question. Generally,decisions regardingsuch cases
507
since therewill be relativelylittle
may be difficult
internalevidenceforassessingtheirvalidity.The decision to retainthemmaynecessarilybe based on faith
alone ifexternalevidenceoftheirauthenticity
is lacking.For an extremeexample,in theFACE data, it is
not possible using(3.9) to test(3, 20) as an outlying
pair since removingthemresultsin a rankdeficient
model.
9. ACKNOWLEDGMENTS
This
work was
supported by grant
1-R01-GM25587fromtheNational Instituteof General Medical Science,NIH. We are grateful
to G. W.
Stewartforseveralenlightening
discussionsconcerning computationalproblemsrelevantto thispaper,
and to the refereesformanyhelpfulcomments.
REFERENCES
ANDREWS, D. F. (1974). A robustmethodformultiplelinear
16, 523-31.
regression.Technometrics,
ANDREWS, D. F. and PREGIBON, D. (1978). Findingoutliers
thatmatter.J. Roy. Statist.Soc. B, 40, 85-93.
BELSLEY, D. A.,KUH, E. and WELSCH, R. E. (1980). Regression
Diagnostics.New York: Wiley.
BINGHAM, C. (1977). Some identitiesusefulin the analysis of
residuals from linear regression.Technical Report No. 300,
School of Statistics,Universityof Minnesota, St. Paul, MN
55108.
BICKEL, P. (1975). One stepHuberestimatesin thelinearmodel.
J. Amer.Statist.Assoc.,70, 428-434.
BRADLEY, R. A., SRIVASTAVA, S. S. and LANZDORF, A.
(1979). Some approaches to statisticalanalysis of a weather
modificationexperiment.
Technical ReportM490, Department
of Statistics,Florida State University,
Tallahassee,FL 32306.
COOK, R. D. (1977). Detectionofinfluential
observationsin linear
regression.Technometrics,
19, 15-18.
COOK, R. D. (1979). Influential
observationsin linearregression.
J. Amer.Statist.Assoc.,74, 169-174.
COOK, R. D. and HOLSCHUH, N. (1979). Commenton field
experimentationin weather modification.J. Amer. Statist.
Assoc.,74, 68-70.
DONGARRA, J.,BUNCH, J.R., MOLER, C. B. and STEWART,
G. W. (1979). The LINPACK User'sGuide.Philadelphia:SIAM.
DRAPER, N. and JOHN, J.A. (1979). Influential
observationsand
outliersin regression.
TechnicalReport581,Departmentof Statistics,Universityof Wisconsin;to appear in Technometrics.
FURNIVAL, G. and WILSON, R. (1974). Regressionbyleaps and
bounds. Technometrics,
16, 499-511.
GENTLEMAN, J. F. and WILK, M. B. (1975). Detectingoutliers
II: Supplementing
thedirectanalysisofresiduals.Biometrics,
31,
387-410.
HOAGLIN, D. C. and WELSCH, R. (1978). The hat matrixin
regressionand ANOVA. AmericanStatistician,
32, 17-22.
HUBER, P. (1973). Robust regression:Asymptotics,
conjectures
and Monte Carlo. Ann.Statist.,1, 799-821.
HUBER, P. (1975). Robustnessand designs.In A SurveyofStatistical Designand Linear Models,ed. by J.N. Srivastava,pp. 287302. Amsterdam:NorthHolland.
JAECKEL, L. A. (1972). The infinitesimaljackknife. Bell
LaboratoriesMemorandum;MurrayHill, N.J.07974.
MALLOWS, C. L. (1975). Some topics in robustness. Bell
LaboratoriesMemorandum;MurrayHill, N.J.07974.
STEWART, G. W. (1974). Introduction
to Matrix Computation.
New York: AcademicPress.
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions
508
R. DENNIS COOK AND SANFORD WEISBERG
SRIKANTAN,K. S. (1961).Testingfora singleoutlierin a
regressionmodel.SankhyaA, 23, 251-260.
WEISBERG, S. (1980). AppliedLinear Regression.New York:
Wiley.
WELSCH, R. E. and KUH, E. (1977).Linearregression
diag-
nostics.
Working
paperNo. 173,NationalBureauofEconomic
MA.
Research,
Cambridge,
WOODLEY, W. L., SIMPSON, J.,BIONDINO, R. and BERresults
1970-75:FloridaareacumuKELEY, J.(1977).Rainfall
lus experiment.Science,195, (February25), 735-742.
TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980
This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM
All use subject to JSTOR Terms and Conditions