Impact of Wayne Fuller’s Contributions to Sample Survey Theory and Practice
Transcription
Impact of Wayne Fuller’s Contributions to Sample Survey Theory and Practice
Impact of Wayne Fuller’s Contributions to Sample Survey Theory and Practice J.K. Kim 1 Working Group Seminar September 19, 2011 1 Joint Work with J.N.K. Rao at Carleton University, Ottawa, Canada J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 1 / 29 Contents 1 Brief bio sketch 2 Some early work 3 Regression estimation 4 Regression analysis 5 Quantiles 6 Two-phase sampling 7 Small area estimation 8 Measurement errors 9 Nonresponse and imputation 10 Rejective sampling 11 Other contribution 12 List of ThD Theses directed on survey sampling area J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 2 / 29 Brief Bio Sketch Ph. D. in Agricultural Economics, Iowa State University, 1959 Thesis title: A non-static model of the beef and pork economy Thesis advisor: Geoffrey Shepherd Supervised 29 M.S. and 69 Ph. D. theses in sampling, time series analysis and measurement errors. Among former students at least 10 ASA Fellows and one ASA President. Four of them supervised 15 or more Ph. D. theses. Three Wiley books on time series analysis, measurement errors and more recently sample survey theory. Citations Unit root tests: JASA paper 8903 citations, Econometrica paper 5352 citations. Measurement Errors book: 2352 citations J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 3 / 29 Some early work on sampling theory Estimation employing post strata, JASA 1966. Proposed method permits construction of unbiased estimates for populations divided into a large number of small post strata: superior to the customary practice of combining two post strata when one contains few sample elements. Sampling with random strata boundaries, JRSSB 1970. Sampling designs are given that permit unbiased variance estimation and efficiency approximately equal to 1 per stratum design. A procedure for selecting non replacement unequal probability samples, 1971 (unpublished) Unconditional selection probability at each draw equal to pi = xi /X ⇒ πi = npi : permits rotation of sample. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 4 / 29 Regression Estimation Fuller (1968), unpublished report Reasons for the popularity of ratio estimator: computational simplicity, regression line often passes close to the origin implying little loss of efficiency over regression estimator. Both estimators use single weight for all variables and ensure calibration to known total . But regression estimator is location and scale invariant unlike ratio estimator. For domain totals or means, ratio estimation may be very inefficient relative to regression estimation: for example, total acres of corn grown on farms of size less than A. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 5 / 29 Regression Estimation (Cont’d) Calibration estimation Fuller was aware of calibration and range restricted weights as early as 1968: M. Husain’s 1968 Master’s thesis: Construction of regression weights for estimation in sample surveys. Doane Agricultural Services Inc. used regression weights since 1972 for their syndicated market research studies. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 6 / 29 Regression Estimation (Cont’d) Husain’s thesis makes two significant contributions to regression estimation of a mean under SRS. 1 P Find weights {wi , i ∈ s} that minimize φ = i∈s wi2 subject to P P ¯ i∈s wi = 1 and i∈s wi xi = X (Calibration constraints: CC) Note: Under SRS, the objective function φ is equivalent to the well-known chi-squared distance measure of Deville and S¨arndal (1992). Husain went further by imposing range restrictions (RR) a ≤ wi ≤ b, b > a > 0 2 and proposing to solve the problem using quadratic programming. Relax the calibration constraint and P and instead minimize the sum of φ ¯. a distance measure between i∈s wi xi and the population mean X This proposal is a forerunner to the more recent work based on ridge regression (Chambers 1986, Rao and Singh 1997, 2009). J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 7 / 29 Regression Estimation (Cont’d) E. Hwang’s (1978) thesis and Hwang and Fuller (ASA Proceedings, 1978) Iterative procedure: weights satisfy CC at each iteration but not necessarily RR. A good description of the algorithm is given in Fuller et al. (SM, 1994). Authors note that “It will not always be possible to construct weights satisfying the specified restrictions in the specified number of iterations. If the sample is such that the restrictions cannot be met, the program outputs the weights of the last iteration”. Proposed estimator has the same asymptotic variance as the regression estimator. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 8 / 29 Regression Analysis Regression analysis for sample survey, Sankhya 1975 Finite population is treated as a sample from an infinite population. Both finite population and infinite population regression coefficient vectors, B and β are defined. By defining a sequence of populations and samples, central limit theorems for the sample regression coefficient vector b are obtained for SRS and stratified two-stage sampling designs. Consistent estimators of the asymptotic variance of b are also given. If the vector of auxiliary variables x = (1, x2 , · · · , xp )0 is replaced by ¯2 , · · · , xp − X ¯ 0 the vector z = (1, x2 − X Pp ) , then the well-known regression projection estimator y¯r = i∈s wi yi of the mean of Y¯ is identical to the intercept in the regression of y on z. Thus the theory of Fuller (1975) for regression coefficients is applicable to the regression estimator of the mean. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 9 / 29 Regression Analysis (Cont’d) Hidiroglou, Fuller and Hickman (1976). SUPER CARP Authors give a linearization variance estimator which uses the calibration weights instead of the design weights in defining the residuals. This follows from Fuller (1975). The same variance estimator was proposed later in a well-known paper by S¨arndal et al. (1989). J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 10 / 29 Regression Analysis (Cont’d) Informative sampling Fuller (2009). Sampling Statistics, Wiley, sec. 6.3.2 Population model yi = xi0 β + ei , Em (ei ) = 0, Vm (ei ) = σ 2 , i ∈ U Survey weighted estimator of β: !−1 βˆw = X i∈s wi xi xi0 ! X wi xi yi i∈s Estimator βˆw is consistent for β under informative sampling but it can be inefficient if the weights wi vary considerably. Fuller (2009) suggested replacing wi by wi Ψi in the expression for βˆw , where Em (ei | xi , Ψi ) = 0, and then search for optimal Ψi . Pfeffermann and Sverchkov (1999) suggested using Ψi = 1/w ˜ i where w ˜ i is an estimator of Em (wi | xi , i ∈ s). J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 11 / 29 Quantiles Francisco (1987). Ph.D. thesis and Francisco and Fuller (1991). Estimation of quantiles with survey data, Annals of Statistics Well-known Bahadur representation of quantiles for SRS is extended to a more general class of sample designs and the representation is used to show that weighted sample quantiles for complex samples are normally distributed in the limit. Confidence intervals based on test inversion also studied. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 12 / 29 Two-phase sampling Fuller (2003): Estimation for multiphase samples, Wiley book edited by Chambers and Skinner. Domain projection estimator x observed in a large first-phase sample A1 and both x and y observed in a smaller subsample A2 . Fuller (2003) proposed a domain projection estimator for A1 , based on predicted y -values {ˆ yi , i ∈ A1 } obtained from A2 . This estimator can be considerably more efficient than the customary domain two-phase regression estimator based on phase 2 sample if regression of y on x is linear. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 13 / 29 Two-phase sampling (Cont’d) Re-stratified two-phase sampling: Kim, Navarro and Fuller, JASA 2006 This paper gives a consistent replication variance estimator that is applicable to both the double expansion estimator and the reweighted expansion estimator of a total. It is based on a consistent first phase replication variance estimator. Earlier work: Kott and Stukel (SM, 1997) studied jackknife variance estimator along the lines of Rao and Shao (Biometrika, 1992). J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 14 / 29 Small area estimation Battese, Harter and Fuller, JASA, 1988 Introduced unit level models for small area estimation based on nested error linear regression. Best linear unbiased predictors of small area means and their estimators of MSE are studied as well as model diagnostics. Application to estimation of county crop areas using survey and satellite data. Reported actual data in the paper and several subsequent papers used this data set. Other work includes area level models when the sampling variances are estimated, automatic benchmarking using augmented models, estimation of three digit counts in Canadian provinces. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 15 / 29 Measurement errors Fuller (1995): Estimation in the presence of measurement errors, Intl. Statist. Rev. (based on Hansen Lecture) For estimating a population mean, usual estimators remain unbiased under additive measurement errors with zero means. Variance estimation also can be handled through interpenetrating samples method of Mahalanobis (1944). Fuller (1995) demonstrated that the above nice features do not hold in the case of distribution function, quantiles and some other complex parameters. Usual estimators are biased and inconsistent and can lead to erroneous inferences. Bias-adjusted estimators can be obtained if at the design stage resources are allocated to estimate measurement error variance through replicate observations for a subsample. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 16 / 29 Non-response and imputation Unit non-response: Fuller et al., 1994, SM Regression estimator based only the respondent values is shown to be asymptotically unbiased if the inverse of response probability pi is linearly related to xi , the vector of regression variables. This condition is satisfied if the response probability is equal within groups defined by dummy x variables. This result was independently discovered by Sarndal and Lundstrom (2006). Identify xi correlated with pi and (or) yi J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 17 / 29 Fractional hot deck imputation J. Kim and Fuller (2004): Fractional hot deck imputation, Biometrika. For item non-response, ”fractional hot deck imputation replaces each missing observation with a set of imputed values and assigns a weight to each imputed value”. It reduces or eliminates imputation variance unlike usual hot deck imputation. Previous work: Kalton & Kish, Comm. Statist. (1984), Fay, JASA (1996). A consistent replication variance estimator is also proposed. Fractional imputation and the proposed variance estimator “are superior to multiple imputation in general, and much superior to multiple imputation for estimating the variance of a domain mean”. Kim, Brick, Fuller and Kalton (JRSS B, 2006) showed that the bias of multiple imputation variance estimator ”may be sizeable for certain estimators, such as domain means, when a large fraction of the values are imputed”. Authors propose a bias-adjusted variance estimator. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 18 / 29 Rejective sampling Fuller (2009): Some design properties of a rejective sampling procedure, Biometrika. A probability sample is rejected unless the estimated mean of auxiliary variables vector is within a specified distance from the corresponding known population mean vector. Asymptotic properties of regression estimator under rejective sampling remain the same as those of the regression estimator for the original probability sampling procedure. This method is somewhat similar to balanced sampling. Yves Tille’s presentation will provide more details of the two methods. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 19 / 29 Other contribution Extensive advisory work for various agencies: (1) Twenty five years on the Statistics Canada advisory committee on methodology, including 20 years as chair. Fuller plans to retire from the committee at the end of 2011. Several methods used in Statistics Canada are based on Fuller’s suggestions at the advisory committee meetings. Fuller is heavily involved in the work of the Survey Section (now Center for Survey Statistics and Methodology) at Iowa State University. Many of the methods used at the Center are due to Fuller. Extensive consulting work for USDA and Census Bureau. Methods used by the National Resource Inventory of USDA are largely due to Fuller. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 20 / 29 Ph.D. Theses directed (by Fuller) (1) Shi, Chang Sheng (1966), Interval estimation for the exponential model and the analysis of rotation experiment. (2) Wey, Ing-Tzer (1966), Estimation of the mean using the rank statistics of an auxiliary variable. (3) Yusuf-Mia, Mohammed (1967), Sampling designs employing restricted randomization. (4) Lund, Richard E. (1967), Factors affecting consumer demand for meat, Webster County, Iowa. (5) DeGracie, James Sullivan (1968), Analysis of covariance when the concomitant variable is measured with error. (6) Rosenzweig, Martin Stephen (1968), Ordered estimators for skewed populations. (7) McElhone, Donald Hughes (1970), Estimation of the mean of skewed distributions using systematic statistics. (8) Martinez-Garza, Angel (1970), Estimators for the errors in variables model. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 21 / 29 Ph.D. Theses directed (9) O’Brien, Peter Charles (1970), Procedures for selecting the best of several populations. (10) Isaki, Cary Tsuguo (1970), Survey designs utilizing prior information. (11) Gallant, A. Ronald (1971), Statistical inference for nonlinear regression models. (12) Burmeister, Leon Forrest (1972), Estimators for samples selected from multiple overlapping frames. (13) Huang, Her Tzai (1972), Combining multiple responses in sample surveys. (14) Jobson, John David (1972), Estimation for linear models with unknown diagonal covariance matrix. (15) Tejeda-Sanhueza, Herman R. (1973), Statistical analysis and model building for a wheat production system in Chile. (16) Booth, Gordon D. (1973), The errors-in-variables model when the covariance matrix is not constant. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 22 / 29 Ph.D. Theses directed (17) Battese, George E. (1973), Parametric models for response errors in survey sampling. (18) Goebel, John Jeffery (1974), Nonlinear regression in the presence of autocorrelated errors. (19) Wolter, Kirk M. (1974), Estimates for a nonlinear functional relationship. (20) Hidiroglou, Michael A. (1974), Estimation of regression parameters for finite populations. (21) Dickey, David A. (1976), Estimation and hypothesis testing in nonstationary time series. (22) Carter, Randy Lee (1976), Instrumental variable estimation of the simple errors in variables model. (23) Wang, George H. K. (1976), Estimators for the simultaneous equation model with lagged endogenous variables and autocorrelated errors: with application to the U.S. farm labor market. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 23 / 29 Ph.D. Theses directed (24) Hasza, David P. (1977), Estimation in nonstationary time series. (25) Huang, Elizabeth T. H. (1978), Nonnegative regression estimation for sample survey data. (26) Bhattacharyay, Biswanath (1979), Estimation for varying parameter stochastic difference equations. (27) Dahm, Paul F. (1979), Estimation of the parameters of the multivariate linear errors in variables model. (28) Macpherson, Brian D. (1981), Properties of estimation for the parameter of the first order moving average process. (29) Drew, James H. (1981), Nonresponse in surveys with callbacks. (30) Mowers, Ronald P. (1981), Effects of rotations and nitrogen fertilization on corn yields at the Northwest Iowa (Galva-Primghar) Research Center. (31) Lee, Edward H. (1981), Estimation of seasonal autoregressive time series. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 24 / 29 Ph.D. Theses directed (32) Pantula, Sastry G. (1982), Properties of estimators of the parameters of autoregressive time series. (33) Amemiya, Yasuo (1982), Estimators for the errors-in-variables model. (34) Tin Chiu Chua (1983), Response errors in repeated surveys with duplicated observations. (35) Harter, Rachel (1983), Small area estimation using nested-error models and auxiliary data. (36) Hung, Hsien-Ming (1983), Use of transformed LANDSAT data in regression estimation of crop acreages. (37) Miazaki, Edina Shisue (1984), Estimation for time series subject to the error of rotation sampling. (38) Miller, Stephen M. (1986), The limiting behavior of residuals from measurement error regressions. (39) Nagaraj, Neerchal K. (1986), Estimation of stochastic difference equations with nonlinear restrictions. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 25 / 29 Ph.D. Theses directed (40) Schnell, Daniel J. (1987), Estimators for the nonlinear errors-in-variables model. (41) Francisco, Carol A. (1987), Estimation of quantiles and the interquartile range in complex surveys. (42) Morel, Jorge Guillermo (1987), Multivariate nonlinear models for vectors of proportions: A generalized least squares approach. (Joint major professor with Ken Koehler.) (43) Eltinge, John Lamont (1987), Measurement error models for time series. (44) Hasabelnaby, Nancy Ann Eyink (1987), The use of a weighting function in measurement error regression. (45) Sullivan, Gary R. (1989), The use of added error to avoid disclosure in microdata releases. (46) Shin, Dongwan (1990), Estimation for the autoregressive moving average model with a unit root. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 26 / 29 Ph.D. Theses directed (47) Sarkar, Sahadeb (1990), Nonlinear least squares estimators with differential rates of convergence. (48) Park, Heon Jin (1990), Alternative estimators of the parameters of the autoregressive process. (49) Croos, Joseph H. R. (1992), Robust estimation in measurement error models. (50) Tollefson, Margot H. (1992), Variance estimation under random imputation. (51) Adam, Abdoulaye (1992), Covariance estimation for characteristics of the Current Population Survey. (52) Yansaneh, Ibrahim S. (1992), Least squares estimation for repeated surveys. (53) Sanger, Todd M. (1992), Estimated generalized least squares estimation for the heterogeneous measurement error model. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 27 / 29 Ph.D. Theses directed (54) Sriplung, Kai-one (1993), (Joint direction with Stanley Johnson, Economics) Mispricing in the Black-Scholes model: An exploratory analysis. (55) Deo, Rohit (1995), Tests for unit roots in multivariate autoregressive processes. (56) An, Anthony B. (1996), Regression estimation for finite population means in the presence of nonresponse. (57) Sarkar, Pradipta (1997), Estimation and prediction for non-Gaussian autoregressive processes. (58) Chen, Cong (1999), Spline estimators of the distribution function of a variable measured with error. (Joint with F. Jay Breidt.) (59) Roy, Anindya (1999), Estimation for autoregressive processes. (60) Dodd, Kevin W. (1999), Estimation of a distribution function from survey data. (Joint with Alicia Carriquiry.) (61) Goyeneche, Juan Jose’ (1999), Estimation of the distribution function using auxiliary information. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 28 / 29 Ph.D. Theses directed (62) Kim, Jae-kwang (2000), Variance estimation after imputation. (63) Wang, Junyuan (2001), Small area estimation in the National Resources Inventory. (Joint with F. Jay Breidt.) (64) Qu, Yongming (2002), Estimation for the nonlinear errors-in-variables model. (65) Park, Mingue (2002), Regression estimation of the mean in survey sampling. (66) Legg, Jason (2006), (Joint with Sarah Nusser), Estimation for two-phase longitudinal surveys with application to the National Resources Inventory. (67) Wu, Yu (2006), Estimation of regression coefficients with unequal probability samples. (68) Beyler, Nicholas (2010), (Joint with Sarah Nusser), Statistical methods for analyzing physical activity data. (69) Berg, Emily (2010),(Joint with Sarah Nusser), A small area procedure for estimating population counts. J.K. Kim (ISU) Wayne Fuller’s contribution to sample survey 9/19/11 29 / 29