Document 6531329
Transcription
Document 6531329
HISTORICAL PERSPECTIVE OF SURVEY SAMPLING A.K. Srivastava Former Joint Director, I.A.S.R.I., New Delhi -110012 1. Introduction The purpose of this article is to provide an overview of developments in sampling theory and its application particularly with respect to survey data analysis. An attempt is made to present a historical perspective and the background behind the major developments in the field of sample surveys. It is interesting to observe that many of the questions raised and the concepts visualized on intuitive arguments in the early stages of development have reappeared with fresh theoretical justification. Some of these questions relate to foundational aspects of sample surveys. The issues raised in this process have got strong bearing on analytical aspects of survey data analysis. The article is based on various review papers available in literature. However, due to vastness of the areas, attempt is concentrated only to the important developments and exhaustiveness of the review is not attempted. The developments in survey sampling may be classified into following four periods: i. Early developments (Before Neyman (1934)) ii. Between 1934 to early 50’s (to be specific 1955) iii. Between 1955 to early 80’s iv. More recent developments (After 1980) 2. Early Developments The first account of a strong plea for the use of samples in data collection was made by Kiaer (1895) at I.S.I. meeting in Berne. Kiaer presented a report on his experience with sample surveys conducted in the Norwegian Bureau of Statistics and advocated further investigations in the field. Kiaer’s plea was received with skepticism. It was felt that the method of sample survey could not replace the method of enumeration. At that stage it was felt that there were three survey methods which were possible: i. Complete enumeration ii. Monography, and iii. Statistical Exploration (term, then used to describe Kiaer’s method) In the subsequent meetings at St. Petersburg (1899) and at Budapest (1901), Kiaer defended his method emphasizing representativeness of the sample. Kiaer’s efforts bore fruits at the Berlin Session of the ISI in 1903. In this Session, the ISI adopted a resolution which recommended the use of the representative method, subject to the provision that in the publication of the results the conditions under which the selection of the observations was made were completely specified. The sample survey had become an acceptable method of data collection. There are four important principles involved in Kiaer’s approach: a) Representativeness b) Lack of subjectivity c) Reliability of the results should be assessed d) Complete specification of the method of selection be included with the results of any sample survey It may well be realized that how important these considerations have been in shaping the future of sampling theory and its application. Kiaer’s method of selection had been in effort, a proportional stratified multistage sampling (of course without random selection). Bowley (1906) was the first to supply a theory of inference for survey samples and using Edgeworth’s Bayesian version of the Central Limit Theorem. He was able to assess the accuracy of estimators made from large samples drawn by simple random sampling from large finite populations. His theoretical analysis showed that very often quite small samples are good enough and census was not always necessary. Bowley’s method was, however, limited to simple random sampling only. At this stage, random sampling was considered a feasible proposition and was treated at par with purposive method of selection. The two methods represent logical developments of the methodology presented by Kiaer. The method of random selection carried to its logical conclusion that the selection of units should be made in an objective manner. The method of purposive selection was a logical extension of the principle that the sample should be a miniature of the whole population. What was lacking was the means to bring these two principles together. 3. Neyman (1934) and Subsequent Developments Neyman’s (1934) paper ‘On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection’ was a watershed in the development of sampling theory and its practice. His paper was a powerful and convincing thesis which established random sampling not only as a viable alternative but also as a much superior tool than purposive method of selection. He developed a theory of inference based on confidence intervals which is suitable for use with populations of kind encountered in survey work. He suggested that it was possible; using the idea of confidence intervals, to define what he termed as a representative method of sampling and a consistent method of estimation. The use of confidence intervals derived from the probabilities due to the sampling design provided a new framework of inference for finite population sampling. No longer was there any need to make prior assumptions about the population as was necessary for the application of Bayes’ argument. The validity of confidence statements did not depend on any assumptions about the population. This new framework of inference was basically nonparametric and, as such particularly attractive. This feature was, in fact, so appealing that for the next thirty years or so, inference for samples drawn from infinite populations were firmly tied to the distribution generated by the randomization of the sampling design. This distribution is sometimes called the p-distribution. Neyman, considering the choice of estimators which could be most suitable for the construction of confidence intervals, suggested that two requirements could be formulated as follows: I.11 i. They must follow a frequency distribution which is already tabulated or may be easily calculated ii. The resulting confidence intervals should be as narrow as possible The first of these requirements was suggested because of practical considerations. The second introduced the concept of efficiency. Neyman’s final conclusions could be summarized as follows: i. It is possible to select random (probability) samples of groups of units (clusters) which are measurable (i.e. the variance can be estimated from the sample itself) ii. The number of sampling units in the sample should be large; iii. Subjective or objective information can be used in sample design without departing from probability sampling; to make the validity of the estimates dependent on prior guesses about the population is dangerous and generally unsuccessful. iv. The only method which can be advised for general use is stratified random sampling v. There are instances where purposive selection may be used with success, but such instances are exceptional and should not be used to determine the general strategy of sample design. The next 15 years (1935 to 1950) witnessed a very rapid all round growth in survey sampling based on random sampling approach. Various probability sampling methods were developed and refined during this period. A theory for systematic sampling, which was already used in early stages, was developed (Madow and Madow, 1944). Probability proportional to size sampling method was introduced (Hansen and Hurvitz (1943). Ratio and regression method of estimation were also introduced during the 1930 with a comprehensive account of the theory being provided by Cochran (1942). The concept of double sampling was introduced by Neyman (1938). Sampling over successive occasion was introduced by Jessen (1942) and further developed by Patterson (1950). An interesting feature of the developments during this period was that methods developed were being simultaneously tested through actual surveys. In fact, the need for various methods was coming from practical consideration only. In India at Indian Statistical Institute, Calcutta, various methods like Interpenetrating sampling and cost functions were developed. The jute survey conducted by Mahalanobis (1938) is an elegant example for conducting pilot sample surveys. The methodology of crop estimation surveys through crop cutting techniques was developed at Indian Council of Agricultural Research. By 1950 quite substantial developments had taken place which were consolidated in the form of various text books (Yates (1949), Deming (1950), Cochran (1953), Hansen, Hurwitz and Madow (1953) and Sukhatme (1954). 4. Between 1955 to 1980 The concept of varying probability sampling which was considered for with replacement case by Hansen, Hurwitz and Madow in 1943 was further developed for without replacement by Narain (1951) and Horvitz and Thompson (1952). These papers led to a series of studies on various methods of varying probability sampling without replacement. But a more important impact was on future direction of research in the sampling theory. Horvitz and Thompson considered three classes of linear estimators termed as T1, T2 and T3. These classes have now been extended up to T8 classes. Godambe (1955), while investigating unified theory of I.12 sampling, proved non-existence of uniformly minimum variance estimators. A more elegant proof of this theorem was given by Basu (1971). In the choice of good estimators in a particular class, various concepts like admissibility, hyperadmissibility etc. were considered. Non-existence of the best estimator led to choice of best estimator in restricted classes. In the absence of UMV estimators, concepts like necessary best estimators were considered. Another direction in which the research in sampling theory progressed was attempts to bring estimation problem in sampling theory closer to the estimation problem in usual statistical inference. Concepts like sufficiency and likelihood function were considered. Sufficiency concept lead to improvement of estimators through Rao-Blackwellisation of ordered estimators and for estimators based on distinct units. One of the important finding which was emerging from these considerations was that sampling design should not have a role to play in the estimation of population parameters. In fact the likelihood function in sampling problems is flat and, therefore, is non-informative. Simultaneously model based estimation (Brewer (1963), Royall (1968, 70, 71, 72) were being developed. This approach was known as predictive model of estimation. In this method also the estimation does not depend upon the sampling design. In fact, it depends upon the model. A detailed discussion on these developments is available in the book by Cassel, Sarndal and Wretman (1977). 5. Analysis of survey data (1980 onwards) During the last three decades besides the research regarding foundational and on inferential aspects a change in emphasis has been towards the analysis of survey data such as contingency tables of estimated counts, logistic regression and multivariate analysis. Further, methods that take proper account of the complexity of survey data have been proposed. All these became possible mainly due to enhanced computer capabilities. Most of the literature on sampling theory deals with estimation of population parameters such as means, totals and ratios along with their standard errors. In recent years, considerable attention has also been given to complex descriptive parameters such as domain (subpopulation) totals and mean, quantiles, regression and correlation coefficient. Estimation of domain parameters i.e., the estimation of small area statistics has gained importance in view of growing needs of microlevel planning. The main problem in the estimation of domain parameters is that sample sizes in subpopulations are too small to provide reliable estimates with the help of direct estimators. Hence, it becomes essential to borrow information from related or similar areas through explicit and implicit models to produce indirect estimator which increases the effective sample sizes. The advances in computing facilities have also provided convenient tools for many theoretical developments in this area. Assumption of classical sampling theory methods i.e., units of the sample are independently and identically distributed (i.i.d), no longer hold in case of complex survey data as these data are collected through complex sampling designs. For these purposes even sampling designs like stratified and cluster sampling may be treated as complex ones. Regression coefficients are estimated with the help of multiple regression technique which assumes the independence of observations. Consequently, use of standard ordinary least squares (OLS) technique to survey data for estimating regression coefficient provides misleading result due to dependent sample units. Efforts have been made to incorporate the effect of sampling design with the help of various approaches of inferences. I.13 With the inexpensible and easy access of computers, researchers are interested not only in the descriptive surveys but also in the analytical surveys i.e., investigating relationships among variables in the survey. The analysis of categorical variable from survey data comes under analytical survey. Here also due to violation of i.i.d assumption, standard statistics such as chi-square tests need adjustment to ensure valid conclusions. Lastly, the area of variance estimation has also gained importance in view of complex survey design as well as vast potential of modern computation. The traditional approach need to derive a theoretical formula of variance for each estimator which is quite cumbersome in case of complex survey designs. In case of non-linear estimators it may not even be possible sometimes to obtain such a formulae. However, if we take recourse to what are called Resampling Techniques, we need only to obtain and use single and simple expressions for estimator of variance for all the estimators under the given sampling design. Of course, these techniques are highly computer intensive. To conclude it may be remarked that analysis of survey data in the large scale surveys conducted in the country has been mainly on traditional lines. Utilisation of data through various aspects of recent approaches of complex data analysis are yet to be made. In fact, such attempts will go a long way towards proper utilization of vast amount of data being collected in the country in a regular way. References 1. Basu, D. (1971). An essay on the logical foundations of survey sampling, Apart I. In Foundations of Statistical Inference, Holt, Rinehast and Winston, Toronto, 203-242. 2. Bowley, A.L. (1906). Address to the Economic Science and Statistics Section of the British Association for the Advancement of Science. J. Roy. Statist. Soc., 69, 548-557. 3. Brewer, K.R.W. (1963). A model of systematic sampling with unequal probabilities. Austral. J. Statist., 5, 3-5. 4. Cassel, C.M., Sarndal, C.E., and Wretman, J.H. (1977). Foundations of Inference in Survey Sampling. New York: Wiley. 5. Cochran, W.G. (1942). Sampling theory when the sampling units are of unequal sizes.’ J. Amer. Statist. Assoc., 37, 199-212. 6. Godambe, V.P. (1955). A unified theory of sampling from finite populations. J. Roy. Statist. Soc. B, 17, 269-278. 7. Hansen, M.H., Hurwitz, W.N. and Madow, W.G. (1953). Sample survey methods and theory. John Wiley and Sons, New York, Vols. I and II. 8. Horvitz, D.G. and Thompson, D.J. (1952). A generalisation of sampling without replacement from a finite universe. J. Amer. Statist. Assoc., 47, 663-685. 9. Jessen, R.J. (1942). Statistical investigation of a sample survey for obtaining farm facts. Iowa Agricultural Experimental Station Research Bulletin No. 304. 10. Kiaer, A.N. (1895-6). Observations of experiences concernant des denombrements representatifs. Bull. Int. Stat. Inst., 9, Liv. 2, 176-183. I.14 11. Narain, R.D. (1951). On Sampling without replacement with varying probabilities, J. Ind. Soc. Agril. Statist., 3, 169-174. 12. Neyman, J. (1934). On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. J. Roy. Statist. Soc., 97, 555- 606. 13. Neyman, J. (1938). Contribution to the theory of sampling human populations. J. Amer. Statist. Assoc., 33, 101-116. 14. Royall, R.M. (1968). An old approach to finite population sampling theory. J. Amer. Statist. Assoc., 63, 1269-1279. 15. Royall, R.M. (1970). On finite population sampling theory under certain linear regression models. Biometrika, 57, 377-387. 16. Sukhatme, P.V. (1959). Major developments in the theory and application of sampling during the last twenty five years. Estadistica, 17, 62-679. I.15