Document 6535635
Transcription
Document 6535635
OUTLIERS IN SAMPLE SURVEYS P.D. Ghangurde, Statistics Canada, Ottawa, KIA 0T6, Canada KEYWORDS: estimation 1. Variance-inflation, outlier 2. robust Variance-inflation Model Consider a linear model Introduction Y = X 8 + e , n×1 n×p p×l n×1 In the literature on regression analysis several approaches for detection and treatment of outliers have been developed. In addition to methods based on the mean-shift and variance-inflation models, estimators based on order statistics such as trimmed and Winsorized means and M-estimators based on robust regression methods are available. Regression diagnostics provide methods for critical examination of models and measures of influence of individual outliers and groups of outliers on estimates of parameters (see Beckman and Cook (1983); Cook and Weisberg (1982)). The objective of this paper is to develop outlier robust estimators for sample surveys, based on variance-inflation m o d e l . T h i s model is a simple extension of superpopulation model often implicitly assumed for the traditional design-based estimators and more explicitly used in the prediction approach. Although these estimators are obtained as optimal estimators of parameters of this model, the results on the bias and variance of these estimators and optimal weight reduction for outliers, presented in this paper, are in the framework of finite population sampling. These outlier robust estimators are not model dependent and have not been evaluated by prediction approach. Outlier robust estimators in finite population sampling based on robust regression and prediction approach have b e e n investigated by Chambers (1986). The problem of outliers has been considered in the literature on finite population sampling in the context of estimation of mean or total, usually assuming no auxiliary information in estimation. Estimators obtained by methods based on order statistics, s u c h as Winsorization and trimming, and weight reduction, have been investigated by assuming simple random sampling (see e.g. Fuller (1970); Ernst (1980)). It seems that it is not possible to extend methods based on order statistics to sample designs involving stratification and different sampling ratios and non-response rates between strata and unequal probabilities of selection, which result in unequal design weights. Sample surveys are often periodic with rotation samples designed for estimation of changes. Moreover, estimates are needed at several levels such as stratum, group of strata and domains. Because of these features of sample surveys, estimators based on reduction of weights of outliers are more convenient for use in practice. In Section 2 we introduce the model in which variance is a function of an auxiliary variable x and assume that outliers have inflated variances. The optimal estimators of parameter 8 are derived by assuming k outliers (l.<k<n) in a random sample of size n. In Section 3 conditional mean square error of these estimators and optimal weight reduction have been derived by assuming simple random sampling from a finite population. In concluding remarks in Section 4, comments have b e e n made on possible extensions of outlier robust estimation and the problem of estimation of unit variances and covariances of x and y for outliers and non-outliers. (2.1) 2 2 where e ~ (0, o W), unknown,W is a diagonal variance-covariance matrix with elements wi depending on x i , i = i , 2 . . . . . n, Y is a n-vector of responses of y, X is a design matrix of p auxiliary variables each with n observations assumed fixed, 8 is a p-vector of regression coefficients and e is an error term . Under the model assuming p=1, wi--x g i' of sample means y/x n 1( z Yi/xi ) are and the ?=~i=i mean best ratio of linear ratios unbiased estimators of 8 for g=l and 2 respectively. The model is appropriate for categorical variables in socio-economic surveys. It is known that values of g for many variables lie in the interval [1,2] and more often closer to i than to 2. In practice in multipurpose sample surveys with several y-variables and an auxiliary variable x possibly used for stratification, ratio estimation, although less than optimal for variables with g>1, is often used for all y-variables for convenience of uniform weighting method. We now consider the variance-inflation model for k outliers (1.<k<n) which are the last k s a m p l e units, without loss of generality. Thus Y = X e + e , n×l n×p p×l n×l (2.2) where e ~ (0, 02 W(k)) and W(k) is a diagonal variance-covariance matrix with elements wi , i = i , 2..... n-k and wi/w, i = ( n - k + l ) . . . . . n; w is unknown constant (0<w.<1). This model is a simple extention of the variance-inflation model considered by Pregibon (1981), Cook, Holschuh and Weisberg (1982) and Thompson (1985). We consider ^expression for the best linear unbiased estimator 8 i ( w ) o f 8 under (2.2) for the ease of one outlier, the i th sample unit. Thus I ^ ^ Bi (w) = ~ where for ^ (X' W-Ix) - I Xi ( y i - Y i ) ( l - w ) wi[l- p = 1 , (X'N-Ix)-I , (Z-w) V i i ] and I Xi (2.3) are scalars = (X'W-IX)-I(x'N-Iy) is the estimator of 8 under(2.1) and Vii = w-i i Xi(X,W-ix )- IX,i is the ith diagonal element of variance-covariance matrix V = V(Xs), called leverage of Xi. Also, 0.<Vii.<l whenp=1. For large values of Xi, Vii is close to i, which makes contribution of i to 8i (w) very large. The second term on the right hand sideā¢ of (2.3) shows 736 change due to variance-inflation of i th sample Thus influence of both residual (yi-Yi) and unit. leverage Vii is r e d u c e d due to f a c t o r (l-w). ^ For k (> 1) outliers and wi = x i, the e s t i m a t o r B(i ) (w), (3.1) kwx k + (n-k) Xn-k ( i ) r e p r e s e n t s group of k outliers can also be given in the form which shows weight r e d u c t i o n of outliers, by : wkyk + (n-k) Yn-k. kwy k + (n-k) Yn-k R = where is e s t i m a t o r population (2.4) of the population ratio R = ?/X, means ? and X. The r a t i o R of can be e x p r e s s e d as wkxk + (n-k) Xn-k P ?i + (l-P) 72 R When wi = x#, ~ ( i ) ( w ) ^ r = _. is given by wkr k + ( n - k ) P 21 + (I-P) X2 where X1 and ?i rn-k wk + (n-k) ' and when x i = i for i = I, 2 . . . . . given by (3.2) 9 (2.5) N, ~ ( i ) ( w ) ^ wky k+ (n-k) Yn-k 7 = wk + (n-k) ' are unknown population means of outliers, X2 and 72 non-outliers, P is are unknown population means of is unknown proportion of outliers in the finite population of (x,y). 2 We also assume that 2 2 2 O2x, O2y are unit variances of non-outliers ~Ix' °ly 2 2 2 are unit variances of outliers, Olx > O2x and Oly > (2.6) and Xn-k are sample means of non-outliers, rk and 2 A l s o , oi °2y" xy and °2xy are unit covariances of outliers and non-outliers respectively. By substituting rn-k are sample means of ratios ri = Yi/Xi for 71 = 6172and 21 = 6222 we have where Yk and Xk are sample means of outliers, Yn-k outliers and non-outliers respectively. When w÷0, in limit these estimators are ~ = :Yn_k/Xn_k, r = rn-k R= and ? = Yn-k' which can also be obtained as optimal estimators of ~3 under corresponding mean-shift model. The estimator of the population total can be obtained from (2.6)as NC7 and 72 [1 + P(cs2-1 )] X2 . (3.3) Since in general, 61 # 62 , R ~ 72/22 . We obtain the conditional bias B(RIk) by Taylor series linearization shows reduction of simple N random sampling weight n of k outliers by factor w method as and resulting adjustment of weights of all n units by factor n/[n-k(1-w) ]. F r o m (2.5) and (2.6) it can be B(P, l k ) " it seen that the results for r can be obtained from those for ? by substituting r i for yi for all i. kw? + (n-k) 72 [ I ]- R kwX1 + (n-k) X2 (62 - 61) P(n-k) kw(l-P) } (__). ?2 = [1_--~(1_62)] { n_k(l_w62) X2 These estimators have been investigated in the following Section 3 using conditional inference given the number of outliers k assuming simple random sampling from a finite population. Although it may be possible to use prediction approach and to obtain model dependent estimators incorporating design weights, the results in this paper have been obtained in the framework of finite population sampling. For detection of outliers in the case of ratio estimation and sample mean Cook's distance seems to have good potential (Cook and Weisberg (1982)). 3. [1 + P(61-1 )] (3.4) When w=O, R = Yn-k/Xn-k and (62-61) P B(RIR) = (i_-~i162)) ?2 (X)" 2 When w=l, R = .Vn/Xn with no weight reduction and (62-61) B(Rlk) = ( l - P ( l _ ~ 2 ) ) Outlier Robust E s t i m a t i o n We assume t h a t a finite population of size N contains an unknown proportion P of outliers. The population mean X of an auxiliary variable x is assumed known. A simple random sample of size n is drawn without r e p l a c e m e n t from the population and outliers are identified on the basis of values of some test statistic. Sample units may be identified as outliers if the 5th unit values ( x i , Y 5 ) lie in some region of the sample space d e t e r m i n e d by the t e s t . The t e s t could be based on diagnostics such as Cook's d i s t a n c e (see Ghangurde (1989)). The outlier robust ratio estimator of the population mean of y's is given by ^ ?R = R2, where It can be seen that f(w) = nP-k } 72 {n_k~l_-~2 ) ~2" P(n-k)- kw(1-P)is n_k(l_w62) a monotonic d e c r e a s i n g function of w in ( 0 , l ) and f(w) kw X 0 a c c o r d i n g as P ~ [ n - k ( 1 - w ) ] " Hence, if P > k, ^ conditional bias of R for w=0 is in absolute value ^ k g r e a t e r than t h a t of R for w in (0 1) I f P < " absolute conditional bias increases as w [P(n-k) in 1] and may not be in absolute value lesser (l-P)k' than that for w=O. 737 n increases This analysis of conditional bias of R is only of t h e o r e t i c a l i n t e r e s t ; P is not practice. By linearization of (3.4) outliers, more e f f i c i e n t e s t i m a t i o n can be done by an e x t e n s i o n of Minimum Norm Q u a d r a t i c Unbiased E s t i m a t i o n (MINQUE) m e t h o d (see Rao (1970)) to estimation of heteroscedastic variances and c o v a r i a n c e s when x and y are random. known in and taking e x p e c t a t i o n over k, unconditional bias of R is given by P((s2-(s1) ~'2 B(R) : i+P(~2_1) (~2)(1-P) i [(I+P) - w(l+P+P62) + w2p(s2] + 0(~). The o p t i m u m w, although obtained under the v a r i a n c e - i n f l a t i o n model, can still be used in finite population sampling and mean square e r r o r can be e s t i m a t e d for values of w close to the o p t i m u m given by (3.7) to decide on choice of a n o t h e r value of w. H o w e v e r , due to instability of e s t i m a t e s of bias, it may be p r e f e r a b l e to use w given in (3.7). (3.5) When w=O, P(~2-~I ) (I-P2) B(R) = ?2 In the case of s a m p l e mean ~ the optimum wo can be o b t a i n e d by minimizing mean square e r r o r and is given by > (l+P(~2-~) (X) `~ 0 a c c o r d i n g as ~2<~i . 2 Also when w=l, B(R) = 0 to 0 ( 1 ) . outlier robust sample mean ~ 2 ..... o given in (2.6) can be obtained from above results by substitution i=l, n-k w : ( i - N(l_p) ) o Y + (n-k) The results for xi=l P(aI - I ) 2 722 (3.8) kl for In the l i t e r a t u r e s e v e r a l e s t i m a t o r s obtained by weight r e d u c t i o n and which can be considered as linear combinations of Yk and Yn-k have been i n v e s t i g a t e d (see e.g. Hidiroglou and Srinath (1981)). N, and Xl = ~2 = 62 = I. _ By Taylor series l i n e a r i z a t i o n method V(RIk) " (Rkw)2 V(~klk ) + (n-k)R)2~ V(~n_klk ) X X Their optimal r=r o 2Rk2w 2 ~2 C°V(Xk'Yk I k) - 2P'(~2k)2 Cov(Xn_k, Yn_klk), whereX = [kwXl+(n-k ) X2], ~( = [kwYl+(n-k) and R = ~(/X. (3.6) ^ w = o ?2] w= 2 2 by minimizing mean square °2y n ^2 ( i - ~) Oly + k(l - + k(1 k n) (Y - Yn- ) k 2 k k - n ) ( y k - .Yn_k) it is possible to obtain e s t i m a t e (3.9) 2" ^2 Oly from sample outliers, the e s t i m a t o r is unstable for small values of k. A l t e r n a t i v e m e t h o d s for e s t i m a t i o n of h e t e r o s c e d a s t i c v a r i a n c e s have been proposed in the l i t e r a t u r e (see K l e f f e (1977)). The MINQUE was d e v e l o p e d for p>.1 and n unequal v a r i a n c e s (see Rao (1970)). Assuming t h a t the last k units are e s t i m a t o r s of unit v a r i a n c e s obtained by method are given by n n z (yi-Yn)2 °ly = i=n-k+l k (n-2) n-k n z (yi-Yn) "2 i=i °2y = (n-k) (n-2) (3.7) 2 n e s t i m a t o r s are more 2 2 e s t i m a t o r s of Oly and O2y efficient obtained estimators by of 6 weighted S u b r a h m a n i a m (1971)). 738 2 z (yi-Yn) i=1 - (n-l) (n-2) " These This expression for o p t i m u m w o b t a i n e d by assuming infinite N is simple and needs e s t i m a t i o n of f e w e r p a r a m e t e r s . The means Ux and uy can be e s t i m a t e d by sample means Xn and Yn respectively. Although 2 2 Olx, Oly and Ol×y can be e s t i m a t e d from sample outliers, MINQUE n z (yi-Yn)2 i=1 (n-l) (n-2) ' "2 2 °2x~y + °2y~x - 2°2xy~x~Y 2 2 2 2 Olx~y + OlyUx - 2OlxyUx~y obtained n ^2 (i - ~) Although, In order to obtain an o p t i m a l value of w the conditional mean square e r r o r of ~ given k can be minimized as a function of w. H o w e v e r , if conditional bias ratio is small an optimum w obtained by minimizing conditional v a r i a n c e o f ~ can be used. By equating derivative of conditional v a r i a n c e with r e s p e c t to w to zero we have an equation of third d e g r e e in w involving several parameters. An analytical solution cannot be obtained unless e s t i m a t e s are s u b s t i t u t e d for s e v e r a l p a r a m e t e r s involved. An a l t e r n a t i v e is to minimize limiting value of V(~lk) for infinite N, which is the s a m e as minimizing superpopulation v a r i a n c e of R under the v a r i a n c e - i n f l a t i o n model with Ux and Uy, means of x and y r e s p e c t i v e l y . It may be noted t h a t under the model, bias of R is zero. The optimum w is given by 2 with ~ = - ~ Yk + (1 - -~) Yn-k e r r o r of 7 is the s a m e a s ? with optimalw=wo, since k w o / ( ( n - k ) + kwo) = rok/N. The important a d v a n t a g e of e s t i m a t o r s s u g g e s t e d above is t h a t the weight r e d u c t i o n can be e s t i m a t e d from survey d a t a a n d extensions to o t h e r sample designs seem possible. Thus wo can be e s t i m a t e d by + (k_.~w)2 V(~klk ) + (n_.~)2 V(~n_klk ) X X - estimator efficient and (3.10) than also the usual give more as compared to those least squares (Rao and 4. References ConcludingRemarks Beckman, R.J. and Cook, R.D. (1983), "Outliers," Technometrics, Vol. 25, No. 2, pp. 119-149. Chambers, R . L . (1986), "Outlier robust finite population estimation," JASA, Vol. 81, No. 396, pp. 1063-1069. Cook, R.D. Holschuh, N. and Weisberg, S. (1982), "A note on an alternative outlier model," J.R.S.S., B, 44, No. 3, pp. 370-376. Cook, R.D. and Weisberg,S. (1982), Residuals and Influence in Regression, Chapman and Hall. Ernst, L.R. (1980), "Comparisons of estimators of the mean w h i c h adjust for large observations," Sankhya, C, 42, pp. 1-16. Fuller, W.A. (1970), "Simple estimators of the mean of skewed populations," Technical Report, Department of Statistics, Iowa State University. Gather, U. and Kale B.K. (1988), "Maximum likelihood estimation in the presence of outliers," Communications in Statistics, Theory and Methods, 17 (11), pp. 3767-3784. Ghangurde P.D. (1989), "Outlier robust estimation in finite population sampling," presented at the Statistical Society of Canada meeting, Ottawa. Hidiroglou, M.A. and Srinath, K.P. (1981), "Some estimators of a population total f r o m simple random samples containing large units," JASA, Vol. 78, No. 375, pp. 690-695. Kleffe, J. (1977), "Optimal estimation of variance components," Sankhya, Vol. 39, B, pp. 211-244. Pregibon, D. (1981)' "Logistic regression diagnostics," The Annals of Statistics, Vol. 9, No. 4, pp. 705724. Rao, C . R . (1970), "Estimation of heteroscedastic variances in linear models," JASA, Vol.65, No. 329, pp. 161-172. Rao, J.N.K. and Subrahmaniam, K. (1971), "Combining independent estimators and estimation in linear regression w i t h unequal variances," Biometrics, 27, pp. 971-993. Thompson, R. (1985), "A note on restricted maximum likelihood with an alternative outlier model," J.R.S.S. (B), 47, No. I, pp. 53-55. The variance-inflation model seems to be an appropriate model for outliers in sample surveys. In the optimal estimator based on the model influence of outliers with large residuals and high leverage is reduced due to factor (l-w). U n d e r appropriate assumptions about d e p e n d e n c e of v a r i a n c e on x, it reduces to o u t l i e r robust e s t i m a t o r s R, r and Y, in which weights of outliers are reduced. In the case of ratio estimator, although the problem of d e t e r m i n a t i o n of o p t i m a l weight was simplified by assuming an infinite population, it could still be used in the case of finite population sampling. Although it may be possible to use p r e d i c t i o n a p p r o a c h and to obtain model d e p e n d e n t estimators incorporating design weights, conditional inference in finite population sampling f r a m e w o r k shows t h a t t h e s e o u t l i e r robust e s t i m a t o r s have desirable properties. Extension of these e s t i m a t o r s to s t r a t i f i e d sampling i n c o r p o r a t i n g design weights is being i n v e s t i g a t e d . Implicit in the v a r i a n c e - i n f l a t i o n model for outliers is the assumption t h a t superpopulation is a mixture of distributions with the same mean but d i f f e r e n t variances. In the case of sampling from mixture distributions belonging to the e x p o n e n t i a l family, maximum likelihood e s t i m a t o r s of p a r a m e t e r s have been obtained and sufficient conditions have been established for outliers to occur in samples from these distributions (see e.g. G a t h e r and Kale (1988)). It would be of i n t e r e s t t o i n v e s t i g a t e into the possibility of establishing similar conditions for occurrence of outliers in sampling from finite populations which have mixture distributions. Although unit v a r i a n c e s and c o v a r i a n c e s for outliers can be e s t i m a t e d from outliers in the sample, the e s t i m a t o r could be unstable for small values of k. In the case of r a t i o e s t i m a t i o n an extension of MINQUE method for e s t i m a t i o n of h e t e r o s c e d a s t i c v a r i a n c e s and c o v a r i a n c e s of x and y is needed. The problem of v a r i a n c e e s t i m a t i o n was not discussed since it does not involve any new methodology, once w is t r e a t e d as a c o m p o n e n t of weights of outliers. _ 739