Characterizations of an Empirical Influ
Transcription
Characterizations of an Empirical Influ
American Society for Quality Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression Author(s): R. Dennis Cook and Sanford Weisberg Reviewed work(s): Source: Technometrics, Vol. 22, No. 4 (Nov., 1980), pp. 495-508 Published by: American Statistical Association and American Society for Quality Stable URL: http://www.jstor.org/stable/1268187 . Accessed: 13/11/2012 15:58 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association and American Society for Quality are collaborating with JSTOR to digitize, preserve and extend access to Technometrics. http://www.jstor.org This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 Thefollowing at the24thAnnualFallTechnicalConference oftheChemical paperwas presented Divisionof the ASQC and the Section on Physicaland Engineering Sciences of the ASA in Cincinnati, Ohio,October23-24, 1980. Characterizations of an EmpiricalInfluence Function forDetectingInfluential Cases in Regression R. Dennis Cook and Sanford Weisberg School of Statistics Universityof Minnesota St. Paul, MN 55108 mostoftheeffort infitting fullranklinearregression modelshascentered Traditionally, on thestudy ofthepresence, andform ofrelationships between themeasured variables. As strength is nowwellknown, leastsquaresregression canbe strongly influenced computations bya few modelmaymoreaccurately reflect unusualfeatures cases,anda fitted ofthosecasesthanthe overallrelationships between thevariables. Itisofinterest, foran analyst tobeableto therefore, findinfluential cases and,basedon them,makedecisionsconcerning theirusefulness in a at hand. problem Based on an empiricalinfluence we discussmethodologies forassessingthe function, influence of individual or groupsof cases on a regression We concludewithan problem. theFloridaAreaCumulusExperiments exampleusingdatafrom (FACE) on cloudseeding. KEY WORDS Linearmodels Distancemeasures Robustness Residualplotting Outliertests Cloudseeding 1. INTRODUCTION The problemswe considerarise in the contextof the linearmodel Y=Xp+e (1.1) whereY is an n x 1 vectorofresponses,X is an n x p matrixof known constants,p is a p x 1 parameter vector,and e is an n x 1 vector of errors.Data analyses based on this model usuallycenteron the presence,formand strengthof relationshipbetween the responseand independentvariables(columnsof X). Estimation,hypothesistesting,model selection and predictionare typicalconcerns. Recently,interestin therole thateach observation or case can play in an analysishas increased(here,a Received August1979;revised April1980 case refersto a responseyi along withtheassociated design point, or row of X). A case may be judged influentialif importantfeaturesof the analysis are alteredsubstantially whenit is deletedfromthedata. For example,deletionofone or morecases can result in a substantialchange in the center,orientationor volume of a confidenceellipsoid for unknownparameters.Individualor groups of cases can exerta substantialinfluenceon the analysisand yetgo undetectedwhenthe residualsare examined. An importantclass ofmeasuresof influence can be based on empiricalinfluencefunctions.Let p be the usual least squares estimatorof p based on the full data and let PA be an alternativeleastsquaresestimatorbased on a subsetofthedata. Then,theempirical influencefunctionforp, IFA, is definedto be = PA - (1.2) is one of several influence IFA functions empirical describedby Mallows (1975); see also Jaeckel(1972). The values of IFA are comparable only withinthe givendata setand model.Also,sinceIFA is a p-vector, routine use of it as a diagnostic for isolating influential cases may be labored.Alternatively, fora matrixM and a nonzero givenpositive(semi)definite IFA . 495 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 496 R. DENNIS COOK AND SANFORD WEISBERG scalefactorc, IFA can be characterized by thedis- Theempirical influence function (1.2)forp isgiven by tance,DA(M, c), betweenp and pA definedby DA(M, c)- (IFA)TM(IFA) C IFA = ((i) - p) (1.3) and thedistancemeasure(1.3) is given,forsomeM and c, by The matrixM can be chosen to reflectspecific interests. In someapplications, measurement oftheinfluence = of cases on thefitted Y values, Xp,maybe more than influence on p. For appropriate measuring if is the example, prediction primary goal it maybe convenient to workwitha reparameterized model wheretheregression coefficients are notofinterest. The empirical influence function forY is defined as Di(M, c) = (() 2. DELETING ONE CASE AT A TIME: DISTANCE MEASURES themodel(1.1) withrank(X) = p and Assuming = theleastsquaresestimator a:2I, of P using Cov(e) thefulldata is p = (XTX)- XTY and thefullsample estimate of c2 is2 = yT(I - X(XTX)- lX )Y/(n - p). Let ridenotetheithresidual, ri= yi- x[Tp,wherexf is theithrowofX. Looselyspeaking, caseswithlarge as onesforwhichthemodel rihavebeenconsidered failseitherdue to incorrect functional formor becauseofan outlierin Y. We shallneedadditional notation. A subscript "(i)" addedto a quantity means"withtheithcasedeleted". Thus, for example,X() is an (n - 1) x p matrix derived from X by deletingthe ith row x, p(i)= (X()X())- X(Y), and so on. Also of importance is the projection matrix V = (vij)X(XrX)-1XT, an n x n rank p matrixthatprojects ontothecolumnspaceofX. The diagonalentries vii areofspecialinterest. - O)TM(() - P). (2.2) tocases "Large"valuesofDi(M,c) wouldcorrespond when result in movement in the that, deleted, large estimate ofP.Weshallcalla casewitha largevalueof forestimating Di(M, c) influential p relativeto (M, c). NaturalchoicesforM and c are (X X) and ps2 othersarepossible.Theresultrespectively, although suggested byCook (1977),is ingstatistic, it is conX(IFA). Again,since X(IFA) is an n-vector, venientto considerone-dimensional characterizations.Clearly, thedistancemeasureDA(M,c) maybe as a normofX(IFA) providedM is chosen regarded to be oftheform M = X BX,whereB canbe chosen to reflect interests. specific In thispaper,measures oftheform (1.3)arestudied for alternative estimators obtainedby deletinga singlecase (Section2), or by deletingseveralcases wherelinearfunc(Section3), and also forproblems tionsofp are ofinterest (Section4). Nextweturnto measuresbasedon volumesofconfidence ellipsoids ratherthan on distances(centers);theseare not in termsof the empirical directlyinterpretable influence functions. After a brief discussion ofcomputationsin Section6, themethodsof thispaperare appliedto an exampleusingdata fromtheFlorida AreaCumulusExperiment (Woodleyetal.,1977)on cloudseeding. (2.1) D,(XTX, ps2) = (pi) -)T(X )(p(i)ps2 ). (2.3) Form(2.3)hasa useful The geometric interpretation: ofthedistancebetweenP and P(i)maybe magnitude assessedbycomparing Di(XTX, ps2) to theprobabilitypointsofthecentralF withp and n - p degreesof freedom. This is equivalentto studyingthe least squaresconfidence ellipsoidsforp basedon thefull theellipsoidthatpassesthrough data,andfinding P(i); the F distribution is used only to transform scale.Also,(2.3)may Di(XYX, ps2) to a morefamiliar be rewritten in theform(Bingham, 1977) Di(XTX, ps2) = (~(i) - Y) (t(i) PS2 (2.4) -) thatDj(XTX,ps2) is,asidefromthescale suggesting factorps2, theordinary squaredEuclideandistance thatthe fittedvectormoveswhenthe ithcase is deletedfromthedata. A computationally convenient and revealing form forDi(XTX,ps2) is (Cook,1977) (XX, p =p 1 vii (2.5) whereti= ri/s(1 - vii)12 is theithStudentized residual.Thus,thisdistancemeasureis theproductofa random residual term, t2, and a fixed term, thatthecorrect modelisgiven vii/(1- ii). Assuming by(1.1)and undernormality, Fi = t2(n- p - 1)/(n- p - t2) follows an F-distributionwith 1 and n - -p 1 Theimportance degreesoffreedom. andusefulness of thet2'sand vii'sin an analysishavebeenstudiedby Srikantan (1961),Huber(1975),Cook (1977,1979), Welschand Kuh (1977),HoaglinandWelsch(1978), KuhandWelsch(1980),andWeisberg Belsley, (1980). TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 497 CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION t2or vii Clearly, Di(XTX,ps2) can be largeifeither is large.A casewitha largevalueofviiiscalleda high case whilea case witha largevalueoft2is leverage calleda (potential) outlier. Characteristics ofxTwhichcauseviitobe largecan be seenas follows: is the thattheintercept Assuming denote the model, let Pi < 2 < *... <? pofthecorrected for crossproductmatrix eigenvalues thedata, and P1, ..., Pp_1 denotethecorresponding of Then,bythespectral eigenvectors. decomposition thecorrected crossproductmatrix, ii=- ((x ) (2.6) wherex is thevectorof sampleaverages.Further, letting0,idenotetheanglebetweenP, and (xi - x) we obtain cos(0) pT(xi- x) = ((x, _)T(x, - (2.7) ))11 and andWilk,1975;Cook,1979).Thusthe vii(Gentleman to detect outliersat unusualdesignpoints ability with largevii)canbe muchlessthantheability (cases ofdetecting outliersforcaseswithsmallervaluesof vii. Yet, it is preciselythe pointswiththe lowest potential powerfortheoutliertestthatareusuallyof the greatestinterest.However,if the Bonferroni to get inequalityis appliedto the F distribution levelsfortheoutliertest,probability of significance errorcan be apportioned to casesunequally, giving smallercriticalvaluesto caseswithlargervaluesof vii.One ruleof thistypewouldchoosethecritical valuefortheithcaseto correspond to theci = cxvii/p probability pointofF ratherthantheusualai = a/n pointusedforall i. Thisunequalallocationofprobdistribution abilitydoes not affectthe underlying oftheteststatistic as thevi are assumedto be fixed, but in largesamples,it mayresultin a substantial increasein powerfortheoutliertestforcaseswith largevi. Alternative ChoicesforM, c. V = n + -(Xi- )T(Xi - x) 1=1 #t (2.8) Thus,viiis largeif(1) xiis farfromx,thatis,itiswell removedfromthe bulkof the cases,and (2) xi is in a direction ofan eigenvector corresubstantially ofthecorrected toa smalleigenvalue cross sponding productmatrix.On the otherhand,if (xi - x) is of its direction. small,viiwill be smallregardless Contoursofconstantviiare ellipsoidscentered at x withaxesgivenbytheeigenstructure of(XTX)- 1. The vi also playan important roleinthecharacter oftheusualtestforoutliers in linearregression. The usuallikelihood ratiotest,undera mean-shift model, ofthehypothesis thattheithcaseisnotan outlier, isa monotonicfunction of the ithStudentized residual and thepowerofthistestis a decreasing function of TABLE 1-D,(M, c) = ((,i)- )TM((i)- M c Tx ps T xTx ps X X (i) [diag(XTX) I p)/c,for various M, c. Reduced 2 2 2 (DFFITS)2, Welsch and Kuh (1977) 1 2 -t. v.. p I 2 Cook (1977) v - 11 1F .v . p 1 T 2 PS(i) ps Comments n-p F 1ii i 1-v. p 2 PS(i) -1]1 form 1 2 vii -t p i 1-vii 2 PS(1) T (i) Table 1 includesseveralalternative choicesforM and c. Althoughno detailedcomparisonof these alternatives hasbeenmade,itislikelythatanychoice forM, c withDi(M,c) location/scale invariant would thesameinformation. giveapproximately However, thegeometric ofthedistancemeasure interpretation willvarywithM, c. For example,themeasurewith M = X(i)X(, and c = psi)refers to thedistancefrom to relative to the confidence ellipsoids computed P(i) p without theithcase,andcomputed distances arenot as theyrefer to different metrics. directly comparable In the sequel, only the distance measure sincetheotherreasonDi(XTX, ps2) willbe discussed able choicesfor(M, c) lead to essentially analogous statistics. For simplicity ofnotation, weshallwrite Di forDi(XTX,ps2). t. T i1 -1 x.(XX) T -1 (NDFBETAS)2 Welsch and Kuh (1977) x. 1-v.. ml TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 498 R. DENNIS COOK AND SANFORD WEISBERG so that 3. GENERAL DISTANCE MEASURES FOR SEVERAL CASES The one at a time statisticscan be expected to needed to provide the majorityof the information out the in some data sets carry analysis.However, subsetsof cases can be jointlyinfluential, but individually are uninfluential.Consider, for example, Figure 1. If point C or point D is deleted,the fitted model will change very little.If both are deleted, however,estimatesof parametersmay show large ifA or B is deleted,thefittedline changes.Conversely, will change; ifA and B are bothdeleted,thelinewill stayabout the same. Let I be an m-vectorof indicesthatspecifythe m cases to be deleted.The subscript"(I)" will mean "with the m cases indexedby I deleted",while "I'T' withoutparentheseswill mean with only the cases indexedby I remaining.For example,V, is an m x m submatrixofV formedbytheintersection oftherows and columnsindexedby I, and r, is them x 1 vector of residuals for cases indexed in I. The empirical influencefunctionis IFunction D ) is-= (3.1) and thedistancefunctionDI(XrX, ps2) = D1 is D (), - p)T(XTX)(I - , =pS2 ) Cx Dx (3.3) InsightintoDI can be obtainedby applyingthespectraldecompositionto VI. Thereis an m x mdiagonal matrix A = diag(Al, ..., .m) with 0 < Al < .. < ,, < 1 and an orthogonalmatrixF, such that V, = 'TAF. (3.4) For convenience,we will suppressall indicationthat F and A depend on I. If Am= 1, theninversesin (3.3) do not exist.If the cases indexedby I are removed,theresulting data are rankdeficient, and a unique P() does notexist.Therefore,ifAm= 1, we set D1 = oo. If ;Im < 1, (3.3) can be writtenas D r(FTI - ITAF) F l TAF(I T' - TAF)- r, ps2 (I'r,)I-( - A)-'A(I - A)-'('r) (3.5) ps2 Lettingg = (g) = l'Fr,we obtain D (3.2) (3.2) The geometricinterpretation ofD, is identicalto that ofDi. An influential subsetforestimating p willcorrespond to a largeD1. A convenientformulaforD, can be derivedfrom the result(Bingham,1977) = -(XTX) 1XT(I - V,)- r -) t D r(I - V,)- 'V(I - V,)- r, D, ps2 . gT(I - A)-'A(I - A)- g PS2 m Z _ a 2 '2 g (1 _ )2 (3.6) PS2 Each gl is a linearcombinationoftheelementsofr,, and Var(g) = Var(Fr,) = a21r T( - A)FFT = a2(I - A). Thus, the g1 are uncorrelated with Var(g) = 2(1 - Al). Further,ifwe let Dx hI (3.7) s2(1 - bA) then(3.6) may be rewritten XXx 1 D D p 1- xXX 0 x B X FIGURE 1. A and B are individually butnotjointly.C influential, and D are jointlyinfluential but not individually. h2 Al. h 1- , (3.8) The hiare identicallydistributed. The resemblanceof (3.8) to (2.5) is striking.The role of the t2 is assumedby theh2and vii/(1- v) is replaced by the Al/(1- A,). In (3.8), a sum over m orthogonaldirectionsis required;in (2.5), m = 1. A generalizationof the squared Studentizedresiduals to m cases is givenby r'(l - V,)- lr/s2 = E h2. Under normaltheory(Gentlemanand Wilk,1975),a likelihoodratio teststatisticforthe hypothesisthat TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 499 CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION themcasesarenotan outlying setcanbe computed as (n - p - m) E h2 h2 m(n-p-n 3 (39) The nominaldistribution of F, is F(m, n - p - m). When m = 1, thechoice of viito representleverage is reasonable, sincefrom Table 1,mostchoicesofM, c suggesta monotonic of vi as the transformation fixed of the m > 1,leverdistance When measure. part is more difficult to define. definition One ofleverage is obtained a2 with and s2 by replacing taking age expectations in (3.6) to get E 2d/(l- 21)= tr(V,(I- V1)-1). This definition,however, does dependon M, and c. For themeasureD,(X(')X(J), ps2), the analogous definitionof leverage is E A = tr(V[).Thesetwostatistics arenotmonotonic functionsof each other. While tr(V(I - V,)-1) is Ifsufficient is availableto store computer memory theresidualvectorand all oftheelements ofV,then an efficient based in the on Furnival algorithm part andWilson(1974)methodforsubsetselection canbe written tofindsubsetsofcaseswithlargevaluesofthe outliertest statistic(3.9). However,an equivalent forfinding subsetswithlargeD, is not algorithm sincealtering a subsetbyaddiimmediately apparent, or substitution ofa case can resultin tion,deletion, substantial of V,, and changein theeigenstructure henceinthevalueofDi. Evenso,complete of storage V is usuallyimpractical (butsee Section6), and realistictechniques forfinding influential subsetsshould use onlytheresidualsand thediagonalentries ofV. Usingonlythese,upperboundsforD[ canbe derived, and onlyif theseare sufficiently largemustD, be computed exactly. For the firstupper bound, since Am/(1- im)2 > A /(1 -_ )2, 1 = 1,2, ..., m,(3.6) can be approximated attractive becauseit corresponds to our preferred by metric, tr(V,)is also appealingbecauseit is easily of V, computedfromthe vii,and exactformation is not required.Based on other considerations, Draper and John (1979) suggest using DB < smallvaluesindicating potentialleverage. This,too, is not a monotonic function of theothersuggested ofa leverage values,andchoicefortheform leverage statistic maydependon thepurposeforwhichitisto be used. DI < (1 _ im)2 II - Vi| = n(1 - Al) as a measureof leverage,with Lookingat theD,. 1 Am m n2 g2. - m2 l= ps (1 (3.10) But E g2 = rTrr, = Ei e r2,and hence ps2 - (3.11) For (3.11)to be useful, Ammustbe replacedbyan thatcan be computedwithoutneed approximation forforming to use is V,. The easiestapproximation im < tr(V1), < 1. assuming Thus, tr( 1,) One goal in examining subsetsofm> 1 casesis to findgroupsof cases that,whilenot individually tr(V,) I) are influential whentakenas a group. influential, /i D, < (1 tr(V,))2 ps2 subsetswhichincludecases that Findinginfluential areindividually influential mayadd littleinformation or,equivalently, becausetheobservedinfluence ofthesubsetwillbe ofthesingleinfluential due,in part,to theinfluence E Vii E r2 case.Conversely, an uninfluential subsetthat finding (3.12) includesone or morecasesthataresingly influential I1-E Vii P \ i e I wouldnotdecreasetheinterest in thosecases.Thus, good candidatesforinclusionin subsetswill have Approximation (3.12)dependsonlyon theone at a small distancevalues form = 1, but theymay well timestatistics and providesa potentially different haverelatively largevaluesofviior t2. upper bound for each I. For any subset with it maybe desirableto considerthe Alternatively, to Am is required. tr(Vi)? 1,a betterapproximation thattheindividual cases in an influential This,inturn, possibility If requires V,. mis small(2 or3) finding subsetare related(e.g.by time,order,etc.).In this exactcomputation of D, is probablyas efficient as situation, good candidatesforinclusionin subsets a better forAm. obtaining approximation willincludeindividually influential cases. For a fixed m, let T= max( iE vii) and when the influence of isjudgedon subsets Finally, R2 = max i, i ri2,whereI variesover all subsetsof internal scaling,lowerordersubsetsshouldperhaps sizemunderconsideration. Twoupperboundsforthe beignored, as itis therelative rather thantheabsolute rightsideof(3.12)are then behaviorof D1 that is important. For additional commentson the use of internalvariability, see tr(V,) R2 < (3.13) DI (1 - tr(V,))2 pS2 Kuh and Welsch(1980,p. 29). Belsley, TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions R. DENNIS COOK AND SANFORD WEISBERG 500 and,ifT < 1, D < DI T r2 ' (3.14) (I - T)2 ps2 Theselasttwomaybe combinedto give T R2 Di (1 - T)2 ps2 (3.15) Clearly,(3.12) < (3.13) < (3.15),and (3.12) < (3.14) < areexact. (3.15).Ifm= 1,all fourapproximations subsetswith all relevant An algorithm forfinding fixedmcan be basedon theseapproximations. First, in subsetsof size smaller cases thatare influential thanm maybe eliminated. Then,theremaining vi and r2 are ordered,largestto smallest.The four inequalitiescan then be applied to subsetswith tr(V,)< 1 intheorder(3.15),then(3.14)or(3.13),and is requiredif(3.12) finally (3.12).Exactcomputation is largerthana selectedcutoff point.Byconsidering subsetsaccording to theorderedlistsofviiandr2,the subsetsthatare morelikelyto be influential areconsideredfirst,and, once one of the bounds is subsetsmadeupofcases small,no further sufficiently in lower thelistsneedto be considered. Generally, thismethodwillbe usefulin data setswithn large relative to p,wheretr(V,)willusuallybe lessthan1. In smallerdata sets,relatively moresubsetsmustbe ofthe Theseresults couldbe improved byrefinement algorithm. 4. LINEAR COMBINATIONS In thissection, to weextendtheprevious discussion accommodatethe situationin whichq linearly combinations oftheelements ofP areof independent interest. Thismaybe desirablewheninterest centers on a selectedsubsetofp. Also,onceinfluential cases have been foundusingD, it maybe desirableto isolatetheireffects on thecomponents ofp. Let 4 = Lp,whereL is a q x p rankq matrix. The 4 and (l)= LP(I),is defined distanceDI(+'), between to be DI() ~ - = (,)TF[L(XTX) - r]- qs2 )) (4.1) Thisis a specialcaseofthedistance function D,(M, c) obtainedby choosingc = qs2 and M = LT[L(XTX)- 'L- 'L. (4.2) of Bingham(1977)has shownthatthenumerator DI(*+) can be writtenas qs2D,(+) = r (I - V,)- X,(XTX)-1 x M(XrX)- lX(I - V,)- r, (4.3) considered.For example,form = 2 and m = 3, and fortwo data sets,one withn = 21, p = 10 (discussed in Section7) and the otherwithn = 125,p = 4, the where M is given by (4.2). Apparently, further is not without additional consimplification possible resultsof a simplealgorithm are summarized in straints on L. However, directcomputation ofDI(M) Table2. The number ofsubsetsis lessthanthetotal will probablybe unnecessary for most I: Since numberofpossiblesubsetsbecausecasesinfluential (XTX)- 1/2M(XTX)- 1/2is idempotent, itfollows from in subsetsofa smallersizewerenotconsidered as m < that (4.3) qs2D,(*) ps2Di and, therefore, was increased. Whileindata set 1 littleis gainedby use of the inequalities, in data set 2, withlargen, DI(*) < P D, (4.4) substantialdecreasein computation is apparent. q TABLE 2-Computationsusingupperbounds. Data Set 1 n = 21 p = 8 Data Set 2 n = 125 p = 4 m= 2: Numberof subsets considered 155 7503 Numberof applications 153 651 74 5 of (3.12)-(3.14) Numberof DI computed m = 3: Numberof subsets considered 560 302,621 Numberof applications 560 74,802 Numberof DI computed of (3.12)-(3.14) 520 727 TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 501 CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION forall I and 4t.Thus,ifD, isnegligible, DI(4t)mustbe also. The result was mentioned negligible by (4.4) determinant oftheappropriate crossproductmatrix, and hencethelogarithm oftheratioofthevolumeof (5.1)to (5.2)is Subsetsof P log log Vol(E()) Cook (1979) forthecase m = 1. lossofgenerality, thatthelastq Suppose,without of are of and components P interest, partition = (XTX)-1 - ((Xlxl) = 0) log( log IXr)X( Ix/2 xTx 11/2 js2 F(1-xa,p,n-p) \p/2 x s()F(1 -cx, p, n-p 1)i X = (X1, X2) whereX1 is n x (p - q) and X2 is n x q. Thus L = (0, Iq), and (XTX)-M(XTX)-1 = 2 log(1 - vii) p2 +2 (4.5) whereM is givenby(4.2).Substitution in(4.3)yields qs2DI() = ps2Di - V)-U(I -r - V,)- r, where U = (uij) = X(XTX1)- 'XT. as (4.6)can be written qs2DI() = r[(I - (4.6) Alternatively, V)- 1 U)(I - V)- 1r,. Thus, whena singlecase is deleted(m = 1) x (V- Di()t2 -u (4.7) (4.8) q 1-vii Theinfluence ofa singlecaseona selected subsetofI be determined fromtheresultoftwo maytherefore on thefulldata. separateregressions 5. CHANGE IN VOLUME OF CONFIDENCE ELLIPSOIDS Thus far,the measureof influence of the cases indexedbyI on theregression hasbeentheempirical influence function. existfor Manyotherpossibilities theimpactofthesecaseson theregression. measuring In this section,we considermeasuresbased on changesin volumeof a confidence ellipsoid.We derivea volumemeasureforp whendeleting a single case,and thengivethemoregeneralresult. The (1 - cx)x 100% normaltheoryconfidence interval forp,basedon thefulldata,is givenbyall pointsp* in theellipsoid {Eo:p (P* - )(X X)(p* -) < ps2F( - ; p, n-p)} (5.1) whereF(1 a; n1, n2) is the(1 a) x 100%pointof an F distribution with(n1,n2) degreesof freedom. Thisellipsoid, aftertheithcase is deleted, is {g(i,,'* I(P*-- () )X())(P 0-,) M < pSi)F(1-a, p, n-p- 1)}. (5.2) As is wellknown,thevolumeofan ellipsoidis proportionalto the inverseof the squareroot of the n-p-t in-t/ -1 pF(1 F(1-; ; p,n-p p, n - p )| (5.3) theratioofF values,(5.3)is equivalent to Apartfrom thestatistic COVRATIO givenin Belsley,Kuh and Welsch (1980). This latterstatistic,however,is motivated as a ratioofdeterminants, notas directly a ratioofvolumesofconfidence ellipsoids. Ifthisquantity is largeand positive, thendeletion oftheithcase willresultin a substantial decreasein volume.Deletionofan outlier, forexample, maygive thisresult.On theotherhand,if(5.3) is largeand thecase willresultina substantial negative, increase involume.Thisoccursifviiis near1. For mostcases, (5.3)shouldbe nearzero. Moregenerally, therelative volumesforestimating J,a subsetof q components of thep vector,after themcasesindexedbyI, is foundto be deleting log Vol(Eo, ,i) = vol(E(/), 2 log I-VI og -ii uI lg n-p-m F(1-x; 1?n \ - p - hF(1 -a;q, \ q,n-p) n - p - m) (5.4) whereU isdefined following (4.6),andhiisdefined by (3.7). In particular, sincetheintercept is notgenerally of interest,the ratio, with m = 1 and q = p-1 (<-(f1'-*., * fie-1)r),is log Vol(EO)) = log 1-(1/n) I -a;p+P -llog(E -p-F (1 l,n-p) 2 + og p-t2F(1-a;p-1ln-p-1)) (5.5) Thislastformis recommended forgeneraluse. An alternative volumemeasurewas suggested by Andrews and Pregibon (1978). They define X* = (X Y), and suggest lookingat theratio I ) -() RI(X*)-R((X*) X* Ix*T* x*TXI (5.6) TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 502 R. DENNIS COOK AND SANFORD WEISBERG whichcorrespondsto theproportionin totalvolume generatedby the data that is not due to the cases indexedby I. "Distant" or unusual sets of cases will tendto have RI(X*) smallwhilecases in themiddleof the data will account forlittlevolume. Unlike the earlier volume measures discussed in this section, ofthe R,(X*) is invariantwithrespectto specification responsevariablefromamongthecolumnsofX*,and refersto a p + 1 dimensionalgeometry,while the earliermeasuresare generatedwithspecificreference to Y and to thep dimensionalgeometry ofthecolumn space of X. Draper and John(1979) providea comparisonof R,(X*) and DI. 6. COMPUTATIONAL CONSIDERATIONS Depending on available computerstorage,either the Choleskyfactorization ofX X or a QR factorization of X can be used to obtain submatricesof V. With minimalstorageavailable,we can finda p x p rank p upper triangular matrix R such that XTX = RTR. Then since vii= xT(XTX)- xj, we have ij= (R- xi)(R- x). (6.1) ExplicitinversionofR can be avoided by solvingthe systemxi = RTciforci by back substitution. Then,vi can be computedas an innerproduct,vij= cTcj. With more storageavailable,we can findin addition to R an n x p matrixwithorthogonalcolumns Q1 such thatX = Q1 R. It followsimmediatelythat V = Q1 Qf, so vijcan be foundas the innerproduct of theithand jth rows of Q1. Thus, Q1, requiringnp in V, storagelocations,containsall the information whichrequiresn(n+ 1)/2storagelocations.In addition,computationof uii in (4.8) is easily done ifthe columns of XI (as in Section 4) are the firstp - q columnsofX. Then ui is just thesquarednormofthe firstp - q columnsof the ithrow of Ql. CompletediscussionsoftheCholeskyfactorization (to obtain R) and theQR factorization(to obtain R and Q1) are givenby Stewart(1974). In addition,the LINPACK library(Dongarra et al., 1979) includes well-documented subroutinesforthem. All of the statisticsstudiedin thispaper are functionsoftheresiduals,theviu(and perhapsalso uij)and usual regressionsummariessuchas theresidualmean square. For example,DI can be computedby first findinga = (I - V,)- lr. Then, D, is computedas a quadraticform,D, = aTV,a/ps2. 7. THE 1975 FLORIDA AREA CUMULUS EXPERIMENT (FACE) Judgingthe success of cloud seedingexperiments intendedto increaserainfallis an importantstatistical problem.Resultsfrompast experiments are mixed.It is generallyrecognizedthat,dependingon various environmental contributing factors,seedingcan proTECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER duce an increaseor decrease in rainfall,or have no effect.Moreover,the criticalfactorscontrollingthe responseare, forthe most part,unknown.This fundamental treatment-unit makes judgnonadditivity mentsabout theeffects ofseedingdifficult (Cook and Holshuh,1979). The 1975 Florida Area Cumulus Experiment (FACE) was conductedto determinethe meritsof using silveriodide to increaserainfalland to isolate some ofthefactorscontributing to thetreatment-unit nonadditivity (Woodleyet al., 1977; see also Bradley, Srivastavaand Lanzdorf,1979). The targetconsisted of an area of about 3,000 square miles to the north and east ofCoral Gables, Florida.In thisexperiment, 24 days in the summerof 1975 werejudged suitable forseedingbased on a daily suitabilitycriterionof S - Ne > 1.5, whereS (seedability)is the predicted difference betweenthemaximumheightsofa cloud if seeded and the same cloud ifnot seeded,and Ne is a factorwhichincreaseswithconditionsleadingto naturally rainy days. (For a more detailed description see Woodley et al., 1977). Generally,suitable days werethoseon whichtheseedabilitywas largeand the naturalrainfallearly in the day was small.On each suitableday,thedecisionto seed was based on unrestrictedrandomization;as it happened,12 days were seeded and 12 wereunseeded. The followingvariables were measured on each suitableday: Echo Coverage (C): Percentcloud cover in the experimentalarea, measuredusingradar in Coral Gables, Florida Prewetness(P): Total rainfallin the targetarea one hourbefore seeding(in cubic metersx 107) Echo Motion (E): A classificationindicatinga movingradar echo (1) or a stationaryradar echo (2) ResponseVariable (Y): The amountofrainthatfellin thetargetarea for a six-hourperiodon each suitableday (in cubic metersx 107). The data as presentedby Woodley et al. (1977) are reproducedin Table 3. We have also includedthe variable Time Trend (T): Numberof days afterthe firstday of theexperiment(June16, 1975 = 0). This variable is potentiallyrelevantbecause there may be a time trend in natural rainfall or modificationin the experimental techniques. In addition to selectingdays based on suitability (S - Ne), the investigatorsattemptedto use only days with C < 13 percent. A disturbedday was 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 503 CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION TABLE 3-Datafrom Florida Area CumulusExperiment, 1975 (source: Woodleyet al., 1977). CASE A T 1 2 3 4 5 6 7 8 0 1 1 0 1 0 0 1 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 3 4 6 9 18 25 27 28 29 32 33 35 38 39 53 55 56 59 65 68 82 83 S 1.75 2.70 4.10 2.35 4.25 1.60 1.30 3.35 2.85 2.20 4.40 3.10 3.95 2.90 2.05 4.00 3.35 3.70 3.80 3.40 3.15 3.15 4.01 4.65 definedas C > 13. From Table 3, thefirsttwoexperimentaldays are disturbedwiththesecondday being highlydisturbed(C = 37.9 percent).Thus it can be and the anticipatedthat case 2 may be influential, underthe conditions processunderstudymay differ of case 2. Therefore, case 2 will be deletedfromthe of includingcase 2 will primaryanalysis.The effects be presentedlater. Initially,we shall adopt the model L Y= f3+ I A + 2-T + ,3(S - Ne) + f4C + Pi5LP + &f6E + 1f13(Ax (S- Ne)) + 314(Ax C) + ,31(A xLP) + 16(AxE) (7.1) whereLY = logI0 Y and LP = logI0 P. This model containsall linear termsand all cross productsbetween action (A = 1 for seeded days, A = 0 for unseededdays)and thebase variables.The crossproduct termsare to model the possibilityof treatmentunit nonadditivity. Because of the limiteddegreesof freedomavailable, higherorder termsin the base variableshave not been included. The main goal of our analysis is to describethe AL Y, betweenpredictedrainfallforseeded difference, days and unseededdays, ALY= LY(A = 1)-LY(A = 0) = fi + f13(S - Ne) + 3i14C + f1s5LP+ P16E. Thus,thesubset*T|= (If, fI3, (7.2) f14, f15, 116) is of primaryinterest. Table 4 gives the least square estimatesand their estimatedstandarderrorsforthe coefficients in (7.1) C 13.4 37.9 3.9 5.3 7.1 6.9 4.6 4.9 12.1 5.2 4.1 2.8 6.8 3.0 7.0 11.3 4.2 3.3 2.2 6.5 3.1 2.6 8.3 7.4 p .274 1.267 .198 .526 .250 .018 .307 .194 .751 .084 .236 .214 .796 .124 .144 .398 .237 .960 .230 .142 .073 .136 .123 .168 E 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 SA 0 2.70 4.10 0 4.25 0 0 0 0 2.20 4.40 3.10 0 2.90 2.05 0 0 3.70 0 3.40 3.15 0 4.01 0 CA PA EA y 12.85 5.52 6.29 6.11 2.45 3.61 .47 4.56 6.35 5.06 2.76 4.05 5.74 4.84 11.86 4.45 3.66 4.22 1.16 5.45 2.02 .82 1.09 .28 0 0 0 37.9 3.9 1.267 .198 1 0 0 0 1 0 7.1 .250 0 0 0 0 0 0 0 0 5.2 4.1 2.8 0 .084 .236 .214 0 3.0 7.0 .124 .144 0 0 0 0 3.3 0 6.5 3.1 0 8.3 0 .960 0 .142 .073 0 .123 0 2 0 0 0 1 1 1 0 1 1 0 0 1 0 2 1 0 1 0 as well as the estimatefor a few selected subset models to be discussedlater. Case Analysis:Full Model Table 5 givesri, ti, vii,F1/2 (see 3.9),Di and Di(i) for the full model withoutcase 2. The largesttwo values of each statisticexcept vii listed in Table 5 correspondto cases 7 and 24, both unseeded days withunusuallylow rainfallrecorded.The values of F 12 for i= 7 and 24 and the Studentizedresidual plot (Andrewsand Pregibon,1978) givenin Figure2 suggest that these cases do not conformto the assumedmodel.Using theBonferroni inequalitywith equal allocation of probabilityor the allocation methodof Section 2, the p-value forcase 7 is near 0.05. The mostlikelycandidatefora pair ofoutliersis (7, 24) and theassociatedF-statisticcan be computed to be 21.21 on 2 and 10 degrees of freedom.The Bonferronip-valueusing eithermethodis near 0.06. Although(7, 24) is evidentlyan outlierpair, it has littleinfluence on theleastsquaresestimateofp or qi, D(7, 24) = 0.455, and removalof (7, 24) willmovep onlyto theedge of a 10% confidenceellipsoid. As an alternativeto deletingcases 7 and 24 (thatis, modelingthemwithindicatorvectors),we could considerattempting to expandthemodelto includeadditional termsin the base carriers.For example,if a variable (S - Ne)2 is added to the model,thenthe residualsforboth cases 7 and 24 become relatively small. However,the influencemeasureforthesetwo cases on theparameterestimatefor?q*= {(S - Ne)2} is D(7, 24)(O*) = 19.71,whichsuggeststhatincluding thisvariablehas littleeffectotherthanprovidingan alternativemodel forcases 7 and 24. While in this problemwe preferto deletethesecases,in otherproblems addinganothervariablemay be preferable. TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions R. DENNIS COOK AND SANFORD WEISBERG 504 TABLE 4-Estimated coefficients, standarderrorsand rootmeansquarederror(RMSE)for selecteddata sets and models. Cases Coefficient (2) Deleted (2,7,24) (2,7,24) (2,7,24) (7,24) -0.291a (0.498)b 0.417 (0.400) 0.492 (0.129) 0.436 (0.145) 0.400 (0.142) 2.244 (0.819) 1.426 (0.510) 1.294 (0.190) 1.381 (0.216) 1.458 (0.206) -0.009 (0.003) -0.006 (0. 002) -0.007 (0.001) -0.006 (0. 001) -0.006 (0.001) 0.136 (0.114) 0.006 (0.085) 0.025 (0.028) 0.030 (0.015) 0.022 (0.010) 0.028 (0.012) 0.030 (0.012) 0.436 (0.266) 0.341 (0.146) 0.399 (0.083) 0.379 (0.087) 0.357 (0.085) 0.573 (0.261) 0.265 (0.135) 0.301 (0.074) 0.295 (0.074) 0.293 (0.075) -0.465 (0.178) -0.333 (0.107) -0.326 (0.052) -0.319 (0.053) -0.309 (0.053) 14 (AxC) -0.011 (0.057) -0.023 (0. 028) * -0.021 (0.024) -0. 045 (0.012) ~15 (AXLP) -0.049 (0.443) 0.073 (0.224) * * 16 (AXE) -0.291 (0.354) 0.050 (0.178) * * RMSE (o) 0.291 0.139 0.122 0.123 B1(A) 3 (S-N e) B13 (AXS-Ne) * * * * 0.124 a Estimated coefficient b Estimated standard error Term omitted from computations 2 - t. x _ x < 0 x x xx x x x x -I -2 x X ---_X__-_ -X---.X -------------- x -- - _ x x x - . X24 *II,, -3 -0 .25 0.00 I l I 1 0.25 I I l 0.50 ? iI I v I I 1v 0.75 1.00 Yi FIGURE 2. Studentized residual fullmodel. plot,case2 deleted, The most influentialpair of cases is (3, 20), D(3, 20) = o; whenthesecases are removedthemodel becomes rank deficientand a unique least squares estimateof/, does not exist. The deficiencyarises because A and E x A are identicalaftercases 3 and 20 are removed.Althoughthese cases are jointly influential,they are individuallyuninfluentialand thereis no apparentreason to doubt theirauthenticity.No otherpairhas a seriousinfluenceon p or j. At thispoint we choose to delete the pair of suspectedoutliers,(7, 24). The second column of Table 4 gives the estimatedcoefficients forthe fullmodel without(2, 7, 24). The univariatecase statisticsfor the fullmodel without(2, 7, 24) are summarizedin Figures3, 4 and 5. The residualplot is well-behaved and the univariatecase statisticsrevealno problems or anomalies.Inspectionofthepairwisecase statistics TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION 505 TABLE 5-Univariate case statistics for thefullmodelwithcase 2 deleted. I 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 revealsno joint outliersand, of course,(3, 20) is still the mostinfluential pair. However,in addition,there is a second pair which is highly influential, = 9.0. Individuallythese D(4, 16) = 7.1 and D(4, 16)(4) cases are uninfluential.The high joint influence appearsto be theresultofa largeresidualcorrelation, + 0.89. Since theresidualcorrelationis positive,cases 4 and 16 probablylie on oppositesides ofthecenter of the data, nearlyon the same ray (Cook, 1979). x :^~ ti1 I -: xx x X X 0 -I x xx~~ x x _X ~x~x ' X x .x X -X _, I 0.0 I 0.2 I I 0.4 , I 0.6 , I 0.8 . I 1.0 1 .369 .699 .863 1.423 1.213 3.916 1.370 .587 .880 .452 .338 .475 .148 .481 .326 .606 .698 .040 .699 .527 .597 .704 3.512 -.383 -.714 .872 -1. 366 1.190 -2.643 1.322 .604 -.889 .468 -.352 .492 .154 .497 .339 -.622 .713 .042 .714 .543 .614 .720 -2.519 -.072 -.124 .211 -.258 .158 -.510 .343 .138 -.204 .110 -.090 .121 .039 .094 .079 -.086 .081 .010 .124 .109 .149 .121 -.541 (B51 B13' 15, 14' 2 1/2 ti1 ri CASE(i) , 1.2 FIGURE 3. Studentized residualplot,cases2, 7, 24 deleted, full model. V. . 11 .580 .646 .307 .578 .793 .560 .208 .386 .379 .354 .234 .286 .256 .583 .365 .777 .848 .261 .646 .529 .301 .669 .455 D. 1 .018 .085 .031 .232 .492 .810 .042 .021 .044 .011 .003 .009 .001 .031 .006 .123 .257 .000 .085 .030 .015 .095 .482 D. () ~ 1 .008 .052 .018 .206 .357 .737 .050 .009 .052 .010 .004 .008 .001 .050 .004 .144 .363 .000 .099 .043 .014 .116 .380 16) While removingthese cases may lead to different fordoubting conclusions,no real needorjustification the usefulnessof these two cases is apparent.Our strategyis to leave themin forfurther analyses,and continueto monitortheirinfluence. Final Model (Cases 2, 7 and 24 Deleted) The final model was selected by calculatingall possible subset regressions(Furnival and Wilson, 1974) and examiningthe fewwithsmallestMallows' Cp. Two finalmodelswerechosen.The thirdcolumn of Table 4 gives the estimatedcoefficients for the modelwe consider"best"and thefourthcolumngives those fora possible second choice. The two models differby the presenceof the A x C (/,14) term.The estimatesof the coefficients in common to the two models are quite close. Also, IP141 is less than its standarderrorand mightbe judged unnecessary. The univariatecase statisticsand residualplotsfor the "best"model all appear wellbehaved.The bivariate cases' statisticswerealso inspectedand again no problemswerenoted.In particular,(3, 20) and (4, 16) are no longerinfluential. As a checkon theinfluence of (3, 20) and (4, 16), the best 5 models usingthe Cp criterionwere computedwithoutvarious combinations of these points; our final model was always thesepairsofcases have amongthebest 5. Evidently, littleinfluenceon the termsin the finalsolution. TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 506 R. DENNIS COOK AND SANFORD WEISBERG 1.0 0.8 V.i 0.6 0.4 0.2 0 5 10 Index 15 20 25 20 25 FIGURE 4. Index plot of vii,fullmodel,cases 2, 7, 24 deleted. 0.4 0.3 Di1 0.2 0,1 0,0 0 5 10 15 Index FIGURE 5. IndexplotofDi, fullmodel,cases2, 7,24 deleted. In the finalmodel,the estimatedpredicteddifference ALY (see 7.2) containsthe seedingeffectand onlyone of the fourpossible interactionterms, AL Y = 1.29 - 0.33(S - Ne). (7.3) The coefficient of the seeding suitabilitycriterion, in S - Ne, is negativeand the predicteddifference rainfalldecreasesas S - Ne increases.Accordingto thisresult,seedingproducesa decreaseinrainfallwhen S - Ne > 3.91. In short,contraryto theexperimenters'prioropinion,thereis evidenceto suggestthat criteoptimalseedingoccursiftheseedingsuitability rion is low! Case 2 Recall thatcase 2 was deletedat the outseton the groundsthat it would probably be influentialand thatit may not conformto theprocessunderstudy. The F-statisticfortestingthefitofcase 2 to thefinal modelofSection7.3 has thevalue 16.00with1 and 16 degreesof freedomand theassociatedp-valueis less than0.001. However,it is possiblethatcase 2 can be explainedby one of the deletedterms.To check on this possibility,the previous analysis was repeated withcase 2 included. All qualitativeconclusionsreachedin the analysis withoutcase 2 remain valid with case 2. Also, as For example,in expected,case 2 is highlyinfluential. thefullmodelafterdeletingtheoutlierpair (7, 24) the distancemeasureforcase 2 is D2 = 3.25. The primarydifference betweenthetwo analysesis in the finalmodels.The last columnofTable 4 gives the estimatedcoefficients forthe model judged best usingCp whencase 2 is present.A comparisonof the last threecolumnsof Table 4 suggeststhatcase 2 is influential foronlythe A x C term.This is confirmed by the distancemeasuresforthe subsets (/14) and thefinalmodel, 14= (O, Pi , 2,p 4, 6, /13) from = = 4.15 and 0.57. The A x C termis D2(1l) D2(/14) neededto modelcase 2 only.The predicteddifference fromthe finalmodel withcase 2 is ALY = 1.46 - 0.31(S - Ne) - 0.045C. (7.4) This suggests,in addition to previousconclusions, that the effectof seedingdecreaseswith increasing cloud cover.Admittedly, thisconclusionseemsodd. Finally,the fullmodel withcase 2 was fitusing Huber's proposal 2 (1973) robustestimatorwith a varietyof truncationpoints.The scale was chosenas median [Iril/0.6745]and Bickel's proposal 2 (1975) TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions CHARACTERIZATIONS OF AN EMPIRICAL INFLUENCE FUNCTION was used as the steppingmethod with Andrews' medianestimate(1974) as a startingvalue. After15 iterationswithtruncationpoint 1.0,cases 7, 8, 15 and 24 are given weightsless than 1 (when the robust is viewedas an iteratively methodof fitting weighted least squares method).These fourcases have relativelylarge residuals,and includethe two presumed outlyingcases. Case 2, however,is givenfullweightin thisrobustfit,and it should be no surprisethatthe estimatedchange in rainfallclosely approximates (7.4) ratherthan (7.3). When case 2 is deleted,the resultthenmorecloselyresembles(7.3). Thus,robust methodsthatare relatively insensitive to outliersmay still depend on cases thatare influential because of highleverage. 8. DISCUSSION AND SUMMARY Cases are termedinfluential ifimportantfeaturesof the analysis are alteredsubstantiallywhen theyare deleted. Individual or groups of cases can be influentialand yet go undetectedduringthe usual analysisof residuals.Thus,it is necessaryto consider additionalmethodologiesforisolatingsuch cases. The empiricalinfluencefunctionsfor p and Y as definedheremeasuretheinfluence thatspecifiedcases have on an analysis.However,because these functions are multidimensional, it will usuallybe necessaryto considerlowerdimensionalcharacterizations. The characterizationsand volume ratios presented here can effectively isolate influentialcases and are aids to understanding theunderlying causes.In particular,D, is a usefulomnibusmeasureof influence. Once a subsetof influential cases has beenisolated, it is desirableto investigate thecauses byconsidering the componentsof D,, h2 and Al(cf.(3.8)). A useful summaryofthesecomponentsis providedby 2Eh2and ZA./(1- Al),althoughothereffective summariesare possible.In additionto consideringthe components ofDi, itmaybe usefulto investigate further theeffects of the influential cases by consideringcomponentsof P usingD,(+). Cases can be influential becausetheycorrespondto outlyingresponses,remotepointsin thefactorspace or, perhaps,a combinationofthetwo.The judgment that a case is influential does not necessarilyimply thatit shouldbe deletedor down weighted,although thismay be an attractiveoptionifthecorresponding Studentizedresidualis large. If a case is influential because it is remotein the factorspace,thenit could be themostimportantcase in thedata sinceit mayprovidetheonlyinformation in a regionwherethe abilityto take observationsis limited.Alternatively, such a case mightbe deletedif it is believedthatthemodelfitto thebulkofthedata is not appropriatein a neighborhoodof the case in question. Generally,decisions regardingsuch cases 507 since therewill be relativelylittle may be difficult internalevidenceforassessingtheirvalidity.The decision to retainthemmaynecessarilybe based on faith alone ifexternalevidenceoftheirauthenticity is lacking.For an extremeexample,in theFACE data, it is not possible using(3.9) to test(3, 20) as an outlying pair since removingthemresultsin a rankdeficient model. 9. ACKNOWLEDGMENTS This work was supported by grant 1-R01-GM25587fromtheNational Instituteof General Medical Science,NIH. We are grateful to G. W. Stewartforseveralenlightening discussionsconcerning computationalproblemsrelevantto thispaper, and to the refereesformanyhelpfulcomments. REFERENCES ANDREWS, D. F. (1974). A robustmethodformultiplelinear 16, 523-31. regression.Technometrics, ANDREWS, D. F. and PREGIBON, D. (1978). Findingoutliers thatmatter.J. Roy. Statist.Soc. B, 40, 85-93. BELSLEY, D. A.,KUH, E. and WELSCH, R. E. (1980). Regression Diagnostics.New York: Wiley. BINGHAM, C. (1977). Some identitiesusefulin the analysis of residuals from linear regression.Technical Report No. 300, School of Statistics,Universityof Minnesota, St. Paul, MN 55108. BICKEL, P. (1975). One stepHuberestimatesin thelinearmodel. J. Amer.Statist.Assoc.,70, 428-434. BRADLEY, R. A., SRIVASTAVA, S. S. and LANZDORF, A. (1979). Some approaches to statisticalanalysis of a weather modificationexperiment. Technical ReportM490, Department of Statistics,Florida State University, Tallahassee,FL 32306. COOK, R. D. (1977). Detectionofinfluential observationsin linear regression.Technometrics, 19, 15-18. COOK, R. D. (1979). Influential observationsin linearregression. J. Amer.Statist.Assoc.,74, 169-174. COOK, R. D. and HOLSCHUH, N. (1979). Commenton field experimentationin weather modification.J. Amer. Statist. Assoc.,74, 68-70. DONGARRA, J.,BUNCH, J.R., MOLER, C. B. and STEWART, G. W. (1979). The LINPACK User'sGuide.Philadelphia:SIAM. DRAPER, N. and JOHN, J.A. (1979). Influential observationsand outliersin regression. TechnicalReport581,Departmentof Statistics,Universityof Wisconsin;to appear in Technometrics. FURNIVAL, G. and WILSON, R. (1974). Regressionbyleaps and bounds. Technometrics, 16, 499-511. GENTLEMAN, J. F. and WILK, M. B. (1975). Detectingoutliers II: Supplementing thedirectanalysisofresiduals.Biometrics, 31, 387-410. HOAGLIN, D. C. and WELSCH, R. (1978). The hat matrixin regressionand ANOVA. AmericanStatistician, 32, 17-22. HUBER, P. (1973). Robust regression:Asymptotics, conjectures and Monte Carlo. Ann.Statist.,1, 799-821. HUBER, P. (1975). Robustnessand designs.In A SurveyofStatistical Designand Linear Models,ed. by J.N. Srivastava,pp. 287302. Amsterdam:NorthHolland. JAECKEL, L. A. (1972). The infinitesimaljackknife. Bell LaboratoriesMemorandum;MurrayHill, N.J.07974. MALLOWS, C. L. (1975). Some topics in robustness. Bell LaboratoriesMemorandum;MurrayHill, N.J.07974. STEWART, G. W. (1974). Introduction to Matrix Computation. New York: AcademicPress. TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions 508 R. DENNIS COOK AND SANFORD WEISBERG SRIKANTAN,K. S. (1961).Testingfora singleoutlierin a regressionmodel.SankhyaA, 23, 251-260. WEISBERG, S. (1980). AppliedLinear Regression.New York: Wiley. WELSCH, R. E. and KUH, E. (1977).Linearregression diag- nostics. Working paperNo. 173,NationalBureauofEconomic MA. Research, Cambridge, WOODLEY, W. L., SIMPSON, J.,BIONDINO, R. and BERresults 1970-75:FloridaareacumuKELEY, J.(1977).Rainfall lus experiment.Science,195, (February25), 735-742. TECHNOMETRICS ?, VOL. 22, NO. 4, NOVEMBER 1980 This content downloaded by the authorized user from 192.168.52.69 on Tue, 13 Nov 2012 15:58:27 PM All use subject to JSTOR Terms and Conditions