Document 6531329

Transcription

Document 6531329
HISTORICAL PERSPECTIVE OF SURVEY SAMPLING
A.K. Srivastava
Former Joint Director, I.A.S.R.I., New Delhi -110012
1. Introduction
The purpose of this article is to provide an overview of developments in sampling theory and
its application particularly with respect to survey data analysis. An attempt is made to present
a historical perspective and the background behind the major developments in the field of
sample surveys. It is interesting to observe that many of the questions raised and the concepts
visualized on intuitive arguments in the early stages of development have reappeared with
fresh theoretical justification. Some of these questions relate to foundational aspects of
sample surveys. The issues raised in this process have got strong bearing on analytical
aspects of survey data analysis. The article is based on various review papers available in
literature. However, due to vastness of the areas, attempt is concentrated only to the
important developments and exhaustiveness of the review is not attempted.
The developments in survey sampling may be classified into following four periods:
i. Early developments (Before Neyman (1934))
ii. Between 1934 to early 50’s (to be specific 1955)
iii. Between 1955 to early 80’s
iv. More recent developments (After 1980)
2. Early Developments
The first account of a strong plea for the use of samples in data collection was made by Kiaer
(1895) at I.S.I. meeting in Berne. Kiaer presented a report on his experience with sample
surveys conducted in the Norwegian Bureau of Statistics and advocated further investigations
in the field. Kiaer’s plea was received with skepticism. It was felt that the method of sample
survey could not replace the method of enumeration. At that stage it was felt that there were
three survey methods which were possible:
i. Complete enumeration
ii. Monography, and
iii. Statistical Exploration (term, then used to describe Kiaer’s method)
In the subsequent meetings at St. Petersburg (1899) and at Budapest (1901), Kiaer defended
his method emphasizing representativeness of the sample. Kiaer’s efforts bore fruits at the
Berlin Session of the ISI in 1903. In this Session, the ISI adopted a resolution which
recommended the use of the representative method, subject to the provision that in the
publication of the results the conditions under which the selection of the observations was
made were completely specified. The sample survey had become an acceptable method of
data collection. There are four important principles involved in Kiaer’s approach:
a) Representativeness
b) Lack of subjectivity
c) Reliability of the results should be assessed
d) Complete specification of the method of selection be included with the results of any
sample survey
It may well be realized that how important these considerations have been in shaping the
future of sampling theory and its application. Kiaer’s method of selection had been in effort,
a proportional stratified multistage sampling (of course without random selection). Bowley
(1906) was the first to supply a theory of inference for survey samples and using Edgeworth’s
Bayesian version of the Central Limit Theorem. He was able to assess the accuracy of
estimators made from large samples drawn by simple random sampling from large finite
populations. His theoretical analysis showed that very often quite small samples are good
enough and census was not always necessary. Bowley’s method was, however, limited to
simple random sampling only. At this stage, random sampling was considered a feasible
proposition and was treated at par with purposive method of selection.
The two methods represent logical developments of the methodology presented by Kiaer. The
method of random selection carried to its logical conclusion that the selection of units should
be made in an objective manner. The method of purposive selection was a logical extension
of the principle that the sample should be a miniature of the whole population. What was
lacking was the means to bring these two principles together.
3. Neyman (1934) and Subsequent Developments
Neyman’s (1934) paper ‘On the two different aspects of the representative method: the
method of stratified sampling and the method of purposive selection’ was a watershed in the
development of sampling theory and its practice. His paper was a powerful and convincing
thesis which established random sampling not only as a viable alternative but also as a much
superior tool than purposive method of selection. He developed a theory of inference based
on confidence intervals which is suitable for use with populations of kind encountered in
survey work. He suggested that it was possible; using the idea of confidence intervals, to
define what he termed as a representative method of sampling and a consistent method of
estimation.
The use of confidence intervals derived from the probabilities due to the sampling design
provided a new framework of inference for finite population sampling. No longer was there
any need to make prior assumptions about the population as was necessary for the application
of Bayes’ argument. The validity of confidence statements did not depend on any
assumptions about the population. This new framework of inference was basically nonparametric and, as such particularly attractive. This feature was, in fact, so appealing that for
the next thirty years or so, inference for samples drawn from infinite populations were firmly
tied to the distribution generated by the randomization of the sampling design. This
distribution is sometimes called the p-distribution.
Neyman, considering the choice of estimators which could be most suitable for the
construction of confidence intervals, suggested that two requirements could be formulated as
follows:
I.11
i.
They must follow a frequency distribution which is already tabulated or may be easily
calculated
ii.
The resulting confidence intervals should be as narrow as possible
The first of these requirements was suggested because of practical considerations. The second
introduced the concept of efficiency. Neyman’s final conclusions could be summarized as
follows:
i.
It is possible to select random (probability) samples of groups of units (clusters)
which are measurable (i.e. the variance can be estimated from the sample itself)
ii.
The number of sampling units in the sample should be large;
iii.
Subjective or objective information can be used in sample design without departing
from probability sampling; to make the validity of the estimates dependent on prior
guesses about the population is dangerous and generally unsuccessful.
iv.
The only method which can be advised for general use is stratified random sampling
v.
There are instances where purposive selection may be used with success, but such
instances are exceptional and should not be used to determine the general strategy of
sample design.
The next 15 years (1935 to 1950) witnessed a very rapid all round growth in survey sampling
based on random sampling approach. Various probability sampling methods were developed
and refined during this period. A theory for systematic sampling, which was already used in
early stages, was developed (Madow and Madow, 1944). Probability proportional to size
sampling method was introduced (Hansen and Hurvitz (1943). Ratio and regression method
of estimation were also introduced during the 1930 with a comprehensive account of the
theory being provided by Cochran (1942). The concept of double sampling was introduced by
Neyman (1938). Sampling over successive occasion was introduced by Jessen (1942) and
further developed by Patterson (1950). An interesting feature of the developments during this
period was that methods developed were being simultaneously tested through actual surveys.
In fact, the need for various methods was coming from practical consideration only. In India
at Indian Statistical Institute, Calcutta, various methods like Interpenetrating sampling and
cost functions were developed. The jute survey conducted by Mahalanobis (1938) is an
elegant example for conducting pilot sample surveys. The methodology of crop estimation
surveys through crop cutting techniques was developed at Indian Council of Agricultural
Research. By 1950 quite substantial developments had taken place which were consolidated
in the form of various text books (Yates (1949), Deming (1950), Cochran (1953), Hansen,
Hurwitz and Madow (1953) and Sukhatme (1954).
4. Between 1955 to 1980
The concept of varying probability sampling which was considered for with replacement case
by Hansen, Hurwitz and Madow in 1943 was further developed for without replacement by
Narain (1951) and Horvitz and Thompson (1952). These papers led to a series of studies on
various methods of varying probability sampling without replacement. But a more important
impact was on future direction of research in the sampling theory. Horvitz and Thompson
considered three classes of linear estimators termed as T1, T2 and T3. These classes have now
been extended up to T8 classes. Godambe (1955), while investigating unified theory of
I.12
sampling, proved non-existence of uniformly minimum variance estimators. A more elegant
proof of this theorem was given by Basu (1971). In the choice of good estimators in a
particular class, various concepts like admissibility, hyperadmissibility etc. were considered.
Non-existence of the best estimator led to choice of best estimator in restricted classes. In the
absence of UMV estimators, concepts like necessary best estimators were considered.
Another direction in which the research in sampling theory progressed was attempts to bring
estimation problem in sampling theory closer to the estimation problem in usual statistical
inference. Concepts like sufficiency and likelihood function were considered. Sufficiency
concept lead to improvement of estimators through Rao-Blackwellisation of ordered
estimators and for estimators based on distinct units. One of the important finding which was
emerging from these considerations was that sampling design should not have a role to play
in the estimation of population parameters. In fact the likelihood function in sampling
problems is flat and, therefore, is non-informative. Simultaneously model based estimation
(Brewer (1963), Royall (1968, 70, 71, 72) were being developed. This approach was known
as predictive model of estimation. In this method also the estimation does not depend upon
the sampling design. In fact, it depends upon the model. A detailed discussion on these
developments is available in the book by Cassel, Sarndal and Wretman (1977).
5. Analysis of survey data (1980 onwards)
During the last three decades besides the research regarding foundational and on inferential
aspects a change in emphasis has been towards the analysis of survey data such as
contingency tables of estimated counts, logistic regression and multivariate analysis. Further,
methods that take proper account of the complexity of survey data have been proposed. All
these became possible mainly due to enhanced computer capabilities. Most of the literature
on sampling theory deals with estimation of population parameters such as means, totals and
ratios along with their standard errors. In recent years, considerable attention has also been
given to complex descriptive parameters such as domain (subpopulation) totals and mean,
quantiles, regression and correlation coefficient.
Estimation of domain parameters i.e., the estimation of small area statistics has gained
importance in view of growing needs of microlevel planning. The main problem in the
estimation of domain parameters is that sample sizes in subpopulations are too small to
provide reliable estimates with the help of direct estimators. Hence, it becomes essential to
borrow information from related or similar areas through explicit and implicit models to
produce indirect estimator which increases the effective sample sizes. The advances in
computing facilities have also provided convenient tools for many theoretical developments
in this area.
Assumption of classical sampling theory methods i.e., units of the sample are independently
and identically distributed (i.i.d), no longer hold in case of complex survey data as these data
are collected through complex sampling designs. For these purposes even sampling designs
like stratified and cluster sampling may be treated as complex ones.
Regression coefficients are estimated with the help of multiple regression technique which
assumes the independence of observations. Consequently, use of standard ordinary least
squares (OLS) technique to survey data for estimating regression coefficient provides
misleading result due to dependent sample units. Efforts have been made to incorporate the
effect of sampling design with the help of various approaches of inferences.
I.13
With the inexpensible and easy access of computers, researchers are interested not only in the
descriptive surveys but also in the analytical surveys i.e., investigating relationships among
variables in the survey. The analysis of categorical variable from survey data comes under
analytical survey. Here also due to violation of i.i.d assumption, standard statistics such as
chi-square tests need adjustment to ensure valid conclusions.
Lastly, the area of variance estimation has also gained importance in view of complex survey
design as well as vast potential of modern computation. The traditional approach need to
derive a theoretical formula of variance for each estimator which is quite cumbersome in case
of complex survey designs. In case of non-linear estimators it may not even be possible
sometimes to obtain such a formulae. However, if we take recourse to what are called
Resampling Techniques, we need only to obtain and use single and simple expressions for
estimator of variance for all the estimators under the given sampling design. Of course, these
techniques are highly computer intensive.
To conclude it may be remarked that analysis of survey data in the large scale surveys
conducted in the country has been mainly on traditional lines. Utilisation of data through
various aspects of recent approaches of complex data analysis are yet to be made. In fact,
such attempts will go a long way towards proper utilization of vast amount of data being
collected in the country in a regular way.
References
1. Basu, D. (1971). An essay on the logical foundations of survey sampling, Apart I. In
Foundations of Statistical Inference, Holt, Rinehast and Winston, Toronto, 203-242.
2. Bowley, A.L. (1906). Address to the Economic Science and Statistics Section of the
British Association for the Advancement of Science. J. Roy. Statist. Soc., 69, 548-557.
3. Brewer, K.R.W. (1963). A model of systematic sampling with unequal probabilities.
Austral. J. Statist., 5, 3-5.
4. Cassel, C.M., Sarndal, C.E., and Wretman, J.H. (1977). Foundations of Inference in
Survey Sampling. New York: Wiley.
5. Cochran, W.G. (1942). Sampling theory when the sampling units are of unequal
sizes.’ J. Amer. Statist. Assoc., 37, 199-212.
6. Godambe, V.P. (1955). A unified theory of sampling from finite populations. J. Roy.
Statist. Soc. B, 17, 269-278.
7. Hansen, M.H., Hurwitz, W.N. and Madow, W.G. (1953). Sample survey methods and
theory. John Wiley and Sons, New York, Vols. I and II.
8. Horvitz, D.G. and Thompson, D.J. (1952). A generalisation of sampling without
replacement from a finite universe. J. Amer. Statist. Assoc., 47, 663-685.
9. Jessen, R.J. (1942). Statistical investigation of a sample survey for obtaining farm
facts. Iowa Agricultural Experimental Station Research Bulletin No. 304.
10. Kiaer, A.N. (1895-6). Observations of experiences concernant des denombrements
representatifs. Bull. Int. Stat. Inst., 9, Liv. 2, 176-183.
I.14
11. Narain, R.D. (1951). On Sampling without replacement with varying probabilities, J.
Ind. Soc. Agril. Statist., 3, 169-174.
12. Neyman, J. (1934). On the two different aspects of the representative method: The
method of stratified sampling and the method of purposive selection. J. Roy. Statist.
Soc., 97, 555- 606.
13. Neyman, J. (1938). Contribution to the theory of sampling human populations. J.
Amer. Statist. Assoc., 33, 101-116.
14. Royall, R.M. (1968). An old approach to finite population sampling theory. J. Amer.
Statist. Assoc., 63, 1269-1279.
15. Royall, R.M. (1970). On finite population sampling theory under certain linear
regression models. Biometrika, 57, 377-387.
16. Sukhatme, P.V. (1959). Major developments in the theory and application of
sampling during the last twenty five years. Estadistica, 17, 62-679.
I.15