Training Manual on Sample Design for Surveys Draft 2006

Transcription

Training Manual on Sample Design for Surveys
Draft 2006
Note: This is a draft training document developed by
the International Programs Center of the US Bureau
of the Census for training in developing countries.
CHAPTER 1
GENERAL NATURE OF SAMPLE SURVEYS
________________________________________________________________________________
1.1
ROLE OF SAMPLING IN STATISTICAL THEORY AND
METHODS
In a broad sense, sampling theory can be considered as coextensive with modern statistical methods.
Almost all of the modern developments in statistics relate to the inferences that can be made about a
population when information is available from only a sample of the elements of the population.
Some of the ways in which this is reflected in statistical programs are mentioned below.
1.1.1
Survey Work
In most survey work, the population consists of all persons (or housing units, households, industrial
establishments, farms, etc.) in a city or other area. Information is obtained or desired from a sample
of the population, but inferences are required on characteristics of the whole population.
1.1.2
Design and Analysis of Experiments
In the design and analysis of experiments, the population represents all possible applications of
several alternative techniques which can be used. For example, the experiment may be agricultural,
in which a number of fertilizers are being tested. The population is infinite because it represents the
use of the fertilizers in all possible farms over all time. The problem is to design experiments so that
the maximum amount of information can be made available for inferences about the full population,
estimated from a sample of limited size.
1.1.3
Quality Control
In the application of quality control methods in an industrial establishment, for example, the
population is all of the products coming out of a machine. Inferences are needed on how well the
products conform to specifications. The term "quality control" is also applied to a sample check on
the quality of field work done in a sample survey; the sample check is carried out after the actual
survey is completed. Office operations such as editing and coding are also subject to quality control;
a sample of the work is checked to determine if it meets acceptable standards.
1.2
CONTENTS OF CHAPTERS
These chapters will be limited to one aspect of sampling; that is, sampling application in survey
work. They will deal mainly with principles of sampling from the common sense rather than the
mathematical viewpoint, though mathematics cannot be entirely avoided. The emphasis will be on
the methods of sampling that can be used under different conditions. The formulas will be
presented, some without mathematical proof, but with information on how they should be used. Two
types of examples will be used to illustrate the formulas and methods: (a) simple examples to make
the techniques clear, and (b) examples taken from actual surveys to show the realistic applications of
the methods discussed.
First there will be a general discussion of the subject as a whole, including the nature of probability
sampling, and choices of sampling units and sampling frames. Then we shall describe the types of
common sample designs--simple random sampling, stratified sampling, and cluster sampling. The
features of these designs and the methods of sample selection will be discussed. The different
methods of estimating the characteristics of the population from the sample results will also be
treated, as well as how to determine the size of sample required for a particular degree of reliability
and how to calculate sampling errors.
We shall also discuss the problem of estimating, from a sample, the results that would have been
obtained from a full census using the same questionnaire, enumeration or interview procedures,
supervision, etc. These are aspects of the problem of sampling error. There are, of course,
nonsampling errors that arise from wrong responses to questions, or from poorly worded questions.
These are present in complete censuses as well as in sample surveys. Although the lectures are not
primarily concerned with such nonsampling errors, they may be very important. In fact,
nonsampling errors often represent more serious limitations on the use of statistics than sampling
errors.
1.3
REASONS FOR THE USE OF SAMPLES
There are six basic reasons for the use of samples:
(1)
A sample may save money (as compared with the cost of a complete census) when absolute
precision is not necessary.
(2)
A sample saves time, when data are desired more quickly than would be possible with a
complete census.
(3)
A sample may make it possible to concentrate attention on individual cases.
(4)
In industrial uses, some tests are destructive (for example, testing the length of time an
electric bulb will last) and can only be performed on a sample of items.
(5)
Some populations can be considered as infinite, and can, therefore, only be sampled. A
simple example is an agricultural experiment for testing fertilizers. In one sense, a census
can be considered as a sample at one instant of time of an underlying causal system which
has random features in it.
(6)
Where nonsampling errors are necessarily large, a sample may give better results than a
complete census because nonsampling errors are easier to control in smaller-scale
operations.
1.4
ILLUSTRATIONS OF SAMPLING
The following illustrate the use of sampling in various situations.
2
1.4.1
Limited Funds
The use of a sample survey when limited funds are available for collecting information is well
known. Sampling may also be used to save money in tabulation. For example, in the 1950 Census
in the United States most of the data were collected on a 100-percent basis. However, many
tabulations were made on a sample basis (20% or 3-1/3%) for special detailed classifications to save
the cost of tabulating 150,000,000 individual records. The 1960 Census utilized sampling
procedures to an even greater extent in both the collection and the tabulation of data.
1.4.2
Time Saving
Other examples from the 1950 census in the United States illustrate how samples can be used to save
time. The enumeration of the census was taken in April 1950. The time required for processing the
results was such that publication of the results was expected to start in 1951 and continue through
1952. A sample of the census results was selected for quick processing and tabulation, and
preliminary results were published on the basis of this sample. These results were issued 1 to 2 years
earlier than the complete census results.
1.4.3
Concentration on Particular Cases
Some surveys require such intensive and time-consuming interviews that it is impossible to consider
them on any basis except a sample basis. Moreover, the use of sampling permits particular attention
to be given to a limited number of cases. Examples are family budget studies and comprehensive
studies of health conditions.
1.4.4
Sampling for Time Series
Information may be required for a time series when data are available only for particular periods of
time and results are needed promptly. The series may be one of economic activity in the country,
with figures available only on a yearly or monthly basis, or it may be one of producing a learning
curve for which only occasional tests are possible.
1.4.5
Controlling Nonsampling Errors
An interesting example arose in the 1950 United States Census of a case where the relationship
between nonsampling and sampling errors made sample results preferable to complete census results.
The United States has conducted a monthly sample survey of the labor force since 1940. In 1950, it
was based on a sample of 20,000 households. The information obtained in the 1950 complete census
also included labor force status. When the results of the census became available, it was clear that
the figures for both unemployed and employed persons were quite
different from those estimated from the labor force sample survey; the differences were far beyond
what could be expected on the basis of the sampling errors. The problem of reporting in the census
introduced much greater error than the sampling error of the monthly survey (this greater error was
caused by the use of enumerators who, for the most part, were inexperienced in interviewing). Users
of census data were advised, therefore, to use the sample results as the more reliable national
3
statistics on the labor force.
1.5
LIMITATIONS OF SAMPLING
Under certain conditions, the usefulness of sampling becomes questionable. Three principal
conditions can be mentioned.
(1)
If data are needed for very small areas, disproportionately large samples are required, since
precision of a sample depends largely on the sample size and not on the sampling rate. In
this case, sampling may be almost as expensive as a complete census.
(2)
If data are needed at regular intervals of time, and it is important to measure very small
changes from one period to the next, very large samples may be necessary.
(3)
If there are unusually high overhead costs connected with a sample survey, caused by work
involved in sample selection, control, etc., sampling may be impractical. For example, in a
country with many small villages it may be more economical to enumerate all the
households in the sample villages than to enumerate a sample of households within the
sample villages. For office processing, however, a sample of the enumerated households
may be used to reduce the work and costs of producing tabulations.
4
CHAPTER 2
CRITERIA AND DEFINITIONS
_______________________________________________________________________________________________________________
2.1
CRITERIA FOR THE ACCEPTABILITY OF A SAMPLING
METHOD
It has been demonstrated repeatedly in practical applications that modern sampling methods can
provide data of known reliability on an efficient and economical basis. However, although a sample
includes only part of a population, it would be misleading to call a collection of numbers a "sample"
merely because it includes part of a population.
To be acceptable for statistical analysis, a sample must represent the population and must have
measurable reliability. In addition, the sampling plan should be practical and efficient.
2.1.1
Chance of Selection for Each Unit
The sample must be selected so that it properly represents the population that is to be covered. This
means that each unit (farm, household, person, or whatever unit is being sampled) must have a
nonzero probability (chance) of being selected.
2.1.2
Measurable Reliability
It should be possible to measure the reliability of the estimates made from the sample. That is, in
addition to the desired estimates of characteristics of the population (totals, averages, percentages,
etc.) the sample should give measures of the precision of these estimates. As we shall see later, these
measures of precision can be used to indicate the maximum error that may reasonably be expected in
the estimates, if the procedures are carried out as specified, and if the sample is moderately large.
The estimation of precision is not possible unless the selection is carried out so that the chance of
selection of each unit is known in advance and random sampling is used.
2.1.3
Feasibility
A third characteristic is that the sampling plan must be practical. It must be sufficiently simple and
straightforward so that it can be carried out substantially as planned; that is, the sampling theory and
practice will be the same. A plan for selecting a sample, no matter how attractive it may appear on
paper, is useful only to the extent that it can be carried out in practice. When the methods actually
followed are the same (or substantially the same) as specified in the sampling plan, then known
sampling theory provides the necessary measures of reliability. In addition, the measures of
reliability computed from the survey results will serve as powerful guides for future improvement in
important aspects of the sample design.
5
2.1.4
Economy and Efficiency
Finally, the design should be efficient. Among the various sampling methods that meet the three
criteria stated above, we would naturally choose the method which, to the best of our knowledge,
produces the most information at the smallest cost. Although this is not an essential feature of an
acceptable sampling plan, it is clearly a highly desirable one. It implies that the most effective
possible use will be made of all available facilities and resources, such as maps, other statistical data,
personal knowledge, sampling theory, etc.
We shall consider only sampling methods that conform to the above criteria. We shall present basic
theory for various alternative designs which are possible, and methods of measuring their precision.
We shall also stress practical methods of application and considerations of efficiency.
2.2
DEFINITIONS OF TERMS
2.2.1
Statistical Survey
The statistical survey is an investigation involving the collection of data. Observations or
measurements are taken on a sample of elements for making statistical inferences (see Glossary in
Annex A) about a defined group of elements. Surveys are conducted in many ways.
2.2.2
Unit of Analysis
The unit of analysis is the unit for which we wish to obtain statistical data. The most common units
of analysis are persons, households, farms, and business firms. They may also be products coming
out of some machine process. The unit of analysis is frequently called an element of the population.
There may be more than one unit of analysis in the same survey; for example, households and
persons; or number of farms and hectares (or acres) harvested.
2.2.3
Characteristic
A characteristic is a general term for any variable or attribute having different possible values for
different individual units of sampling or analysis. In a sample survey, we observe or measure the
values of one or more characteristics for the units in the sample. For example, we observe (or ask
about) the area of land for rice crop, the number of cattle on a farm, the age and sex of a person, the
number of children per family, etc. So, we observe a unit, but we measure several characteristics of
that unit.
2.2.4
Population or Universe
The population or universe is the entire group of all the units of analysis whose characteristics are
to be estimated. The chapters in this sampling manual will deal primarily with a finite population,
having N units.
6
2.2.5
Probability Sample
A probability sample is a sample obtained by application of the theory of probability. In probability
sampling, every element in a defined population has a known, nonzero, probability of being selected.
It should be possible to consider any element of the population and state its probability of selection.
2.2.6
Sampling with Replacement and Without Replacement
A simple way of obtaining a probability sample is to draw the units one by one with a known
probability of selection assigned to each unit of the population at the first and each subsequent draw.
The successive draws may be made with or without replacing the units selected in the preceding
draws. The former is called the procedure of sampling with replacement, and the latter, sampling
without replacement.
2.2.7
Simple Random Sampling
Simple random sampling is a special case of probability sampling, sometimes called unrestricted
random sampling. It is a process for selecting n sampling units one at a time, from a population of N
sampling units so that each sampling unit has an equal chance of being in the sample. Every possible
combination of n sampling units has the same chance of being chosen. Selection of one sampling
unit at a time with equal probability may be accomplished by either sampling with replacement or
without replacement. Almost, if not all, samples are selected without replacement. Using a table of
random numbers to select the units satisfies this definition of simple random sampling.
2.2.8
Sampling Frame
The totality of the sampling units from which the sample is to be selected is called the sampling
frame. The frame may be a list of persons or of housing units; it may be a subdivided map, or it may
be a directory of names and addresses stored in some kind of electronic medium, such as a file in a
hard disk or a data base.
2.2.9
Parameter
A parameter is a quantity computed from all values in a population set. That is, a parameter is a
descriptive measure of a population. For example, consider a population consisting of N elements.
Then the population total, the population average or any other quantity computed from
measurements including all elements of the population is a parameter. The objective of sampling is
to estimate the parameters of a population.
2.2.10
Statistic
A statistic is a quantity computed from sample observations of a characteristic, usually for the
purpose of making an inference about the characteristic in the population. The characteristic may be
any variable which is associated with a member of the population, such as age, income, employment
status, etc.; the quantity may be a total, an average, a median, or other quantiles. It may also be a rate
of change, a percentage, a standard deviation, or it may be any other quantity whose value we wish to
estimate for the population.
7
Note that the term statistic refers to a sample estimate and the term parameter refers to a population
value.
Note on Quantiles: What is a quantile? If a set of data is arranged in order of magnitude, the
middle value (or the arithmetic mean of the two middle values) which divides the set into two equal
parts is the MEDIAN. By extending this idea we can think of those values which divide the set into
four equal parts. These values, denoted by Q1, Q2 and Q3 are called the first, second and third
quartiles respectively, the value of Q2 being equal to the median. Similarly the values which divide
the data into ten equal parts are called deciles and are denoted by D1, D2, ... D9, while the values
dividing the data into one hundred equal parts are called percentiles and are denoted by P1, P2, ... P99.
The 5th decile and the 50th percentile correspond to the median. The 25th and 75th percentiles
correspond to the first and third quartiles, respectively. Collectively, quartiles, deciles, percentiles
and other values obtained by equal subdivisions of the data are called quantiles.
2.2.11
Independent Information
Independent information consists of data that are known in advance of or simultaneously with the
survey which are not based on the survey but are used to improve the survey design. Such data may
be used for purposes of stratification, for determining the probabilities of selection, or in estimating
the final results from the sample data. The data must be of good, known quality.
2.2.12
Estimate and Estimator
An estimate is a numerical quantity computed from sample observations of a characteristic and
intended to provide information about an unknown population value.
An estimator is a mathematical formula or rule which uses sample results to produce an estimate for
the entire population. For example, the sample average,
is an estimator. It provides an estimate of the parameter, the population average,
That is, the sample average is an estimate of the population average.
Therefore, the estimator refers to a mathematical formula. When numbers are plugged into the
formula, an estimate is produced. However, in common statistical language, the words estimate and
estimator are used interchangeably.
8
2.2.13
Probability of Selection
The probability of selection is the chance that each unit in the population has of being included in
the sample. Probability values range from 0 to 1, inclusive.
2.2.14
Random Variables
A random variable is a variable which, by chance, can be equal to any value in a specified set. The
probability that it equals any given value (or falls between two limits) is either known, can be
determined, or can be approximated or estimated. A chance mechanism determines the value which
a random variable takes. For example, in flipping a coin, we can define the random variable X which
can take the value 1 is the coin lands ‘heads’ and the value 0 if the coin lands ‘tails’. Therefore, the
variable X, as was just defined, can take either one of two values after the coin is flipped.
2.2.15
Probability Distribution
The probability distribution gives the probabilities associated with the values which a random
variable can equal. If there are N values that a random variable X can take, say X1, X2, ... ,XN, then
there are N probabilities associated with the Xi's values, namely P1, P2, ... ,PN. The probabilities and
the values the random variable takes constitute the probability distribution of X.
2.2.16
Illustration
The 2000 U.S Census of Population and Housing found that 281,421,906 persons lived in
105,480,101 households of which 71,787,347 are Family Households and 33,692,754 are NonFamily Households. Table 2.1 below shows the distribution of households by type1.
These data show that 68.1% of all households are of the “family” type and 31.9% are of the
“nonfamily” type. Now if we were to pick a household at random, what is the probability that we
would pick a family household? If each household, large or small, is equally likely to be picked,
then there is a .681 probability of picking a family household.
1
The Census Bureau defines a household as persons who occupy a house, apartment, or other separate living quarters. One of the tests in
determining a household is that there are complete kitchen facilities for the exclusive use of the occupants. People who are not in households live in
group quarters including rest homes, rooming houses, military barracks, jails, and college dormitories.
9
Table 2.12.
Type of U. S. Households, 2000
Type of Household
Number of Households
Family Households
Married Couple
Female Householder, no husband present
Male Householder, no wife present
71,787,347
54,493,232
12,900,103
4,394,012
68.1
51.7
12.2
4.2
Nonfamily Households
One Person
Two or M ore People
33,692,754
27,230,075
6,462,679
31.9
25.8
6.1
105,480,101
100.0
Total Households
2
Fraction of Total
Households
Source: U.S. Census Bureau, 2000 Census of Population and Housing.
10
Exercises
2.1
In order to select a sample of the total population of a city, a sample is selected from the
telephone directory for that city and the families of the persons selected are interviewed.
Does this satisfy the criteria for acceptability? Explain.
2.2
In order to determine the population of a city where all children of school age attend school,
a sample of school children is drawn and their families are interviewed. Give two reasons
why this does not meet the criteria for acceptability. (Think of families who have more than
one child in school and families that don’t have any children.)
2.3
Suppose that you were using sampling to estimate the total number of words in a book that
contains illustrations.
(a) Is there any problem of definition of the population?
(b) What are the pros and cons of (1) using the page, (2) the line as a sampling unit?
2.4
Suppose that you work for a major public opinion pollster and you wish to estimate the
proportion of adult citizens who think the President is doing a good job in heading the
nation's economy. Clearly define the population you wish to sample.
2.5
The problem of finding a frame that is complete and from where a sample can be drawn is
often an obstacle. What kinds of frames might be tried for the following surveys? Do the
frames have any serious weakness?
(a) A survey of stores that sell luggage in a large city.
(b) A survey of the kinds of articles left behind in subways or on buses.
©
A survey of persons bitten by snakes during the last year.
(d) A survey to estimate the number of hours per week spent by family members watching
television.
11
CHAPTER 3
SIMPLE RANDOM SAMPLING
SAMPLING DISTRIBUTION
______________________________________________________________________________
3.1
INTRODUCTION
In this chapter, we shall introduce the concept of the sampling distribution of a statistic, probably the
most basic concept of statistical inference. We shall concentrate only on the sample mean and its
sampling distribution. We shall first introduce certain definitions and relationships of terms needed
for the sampling distribution.
3.2
EXPECTED VALUE
The expected value is the average value for a single characteristic over all possible samples.
Mathematically, we define the expected value (or mean) of a random variable Y as follows:
where
and the Greek letter E is used to indicate the sum of the products of all
possible values of y and their associated probabilities P(y). The small y denotes a particular value of
Y.
The expected value is a weighted average of the possible outcomes, with the probability weights
reflecting the likelihood of occurrence of each outcome. Thus, the expected value should be
interpreted as the long-run average value of Y, if the frequency with which each outcome occurs is in
accordance with its probability.
For example, consider the tossing of a die in which each outcome (numbers 1 to 6) have the same
probability of occurring, 1/6 (assuming the die is not biased). If Y is used to represent the number
that appears when we throw a die, the expected value of Y is given by:
E(y) = 1 (1/6) + 2 (1/6) + 3 (1/6) + 4 (1/6) + 5 (1/6) + 6 (1/6) = 3.5
The expected value of Y is not the most likely or the most typical value of Y. It is the long-run
average value of Y, if we repeatedly perform the experiment that originates the outcomes. Some
throws of the die will produce numbers below 3.5 and others above 3.5. The average of these
different numbers, in the long rum, will be 3.5.
12
3.2.1
Unbiased Estimate
An unbiased estimate has the property that the average of all the estimates obtained from all possible
samples of a given size is equal to the true value. Mathematically, an estimate is unbiased if the
expected value of the estimate is equal to the parameter being estimated.
For example, if
is an estimate of the parameter 2 and if
then
is an unbiased
estimate of 2. Otherwise,
That is, the bias is the difference between the
expected value of an estimate and the true population value (parameter) being estimated.
3.2.2
Consistent Estimate
An estimate is consistent if its values tend to concentrate increasingly around the true value as the
sample size increases. In other words, the estimate assumes the population value with probability
approaching unity as the sample size tends to infinity. This definition of consistency strictly applies
to estimates based on samples drawn from an infinite population. We use the following definition in
the case of a finite population. An estimate
if it takes the population value when n = N.
is said to be a consistent estimate of the parameter Y
In the next section we will see that for simple random sampling the sample mean is an unbiased and
consistent estimate of the population mean as the sample size increases.
3.3
SAMPLING DISTRIBUTION
A sampling distribution is the probability distribution of all possible values that an estimate might
take under a specified sampling plan.
In this section we will show by examples that the sample average (mean) is both an unbiased and a
consistent estimate of the true population average.
Let us first present the idea of a sampling distribution of the mean by actually listing all possible
random samples of size n = 2 which can be drawn from a hypothetical population of N = 5 housing
units (HUs) shown in Table 3.1. We wish to estimate the average household (HH) size of these HUs
from a sample.
Table 3.1
Household Size per Household
HU
1
2
3
4
5
HH Size
3
5
7
9
11
The total number of persons in the population is:
13
The average number of persons per household (or average household size) is:
If we take a sample of size 2 from this population, there are
3 and 5, 5 and 7
3 and 7, 5 and 9
3 and 9, 5 and 11
3 and 11
7 and 9
7 and 11
possibilities, and they are:
9 and 11
The means of these samples are 4, 5, 6, 7, 6, 7, 8, 8, 9, and 10, respectively, and if sampling is
random so that each sample has the probability 1/10, we obtain all the possible samples of size two
HUs from a population of 5 HUs, as shown in Table 3.2. Table 3.3 presents the sampling
distribution of the mean.
Table 3.2
Samples of Two HUs from a Population of 5 HUs.
SAMPLES OF SIZE
VALUE OF
n=2
PROBABILITY
p(y)
3,5
4
1/10
3,7
5
1/10
3,9
6
1/10
3,11
7
1/10
5,7
6
1/10
5,9
7
1/10
5,11
8
1/10
7,9
8
1/10
7,11
9
1/10
9,11
10
1/10
14
Table 3.3
Sampling Distribution of the Mean
Mean
Probability
4
1/10
5
1/10
6
2/10
7
2/10
8
2/10
9
1/10
10
1/10
An examination of this sampling distribution reveals some pertinent information relative to the
problem of estimating the mean of the given population using a random sample of size 2. For
instance, we see that corresponding to
6, 7, or 8, the probability is 6/10 that a sample mean will
not differ from the population mean (which is 7) by more than 1, and that corresponding to
5, 6,
7, 8, or 9, the probability is 8/10 that a sample mean will not differ from the population mean by
more than 2.
Further useful information about this sampling distribution of the mean can be obtained by
calculating its expected value as follows:
We may also use Table 3.2 to compute the expected value of
Note that the same results would be obtained for samples of any size. Recall the definition of the
expected value, which is the average of a single characteristic over all possible samples.
With simple random sampling the sample mean is an unbiased estimate of the true mean.
15
We will now compare the distribution of the sample estimates to show that:
(1)
As the sample size increases, the means of the samples tend to concentrate more and more
around the true average value. In other words, the estimates tend to become more and more
reliable as the sample size increases.
(2)
The percentage distributions of the sample estimates can be used to predict the chance of
obtaining a sample estimate within specified ranges of the true value.
To see the above statements, consider a hypothetical population of 12 individuals. We wish to make
different estimates from a sample of 1,2,3,4,5,6 and 7 individuals. The full population is shown in
Table 3.4 below.
Table 3.4
Income of Hypothetical Population of 12 Persons
Individual
Income
Individual
Income
1
$1,300
7
1,800
2
6,300
8
2,700
3
3,100
9
1,500
4
2,000
10
900
5
3,600
11
4,800
6
2,200
12
1,900
TOTAL INCOME: $32,100
AVERAGE INCOME: $2,675
A frequency distribution of the sample means is illustrated in Table 3.5 for samples of sizes
1,2,3,4,5,6 and 7 individuals. For each sample size, the percentage of the sample estimates falling
within a specified range of the true value and the average of the means are also shown in the table.
For example, the proportion of the sample results falling between $2,000 and $3,400 is 47% for
samples of 2; 58% for samples of 3; 69% for samples of 4; and 78%, 87% , and 94% for samples of
5,6, and 7 respectively. This tells us that by taking samples large enough, the proportion of the
sample estimates falling within a designated interval about the expected value can be made as close
to 100% as desired. That is, we can predict the precision of a sample if we have the distribution of
all sample estimates of a given size for the population. The increasing concentration of sample
estimates around the true value illustrates consistency, a quality possessed by important types of
sample estimates.
16
Table 3.5
All Possible Estimates of Average Income
from Samples Drawn Without Replacement
from the Population of 12 Persons
Average Income
Estimated from Sample
Number of samples having indicated estimate of average income
with sample of size n
n =1
$ 800 to $1,199
$1,200 to $1,399
$1,400 to $1,599
$1,600 to $1,799
$1,800 to $1,999
$2,000 to $2,199
$2,200 to $2,399
$2,400 to $2,599
$2,600 to $2,799
$2,800 to $2,999
$3,000 to $3,199
$3,200 to $3,399
$3,400 to $3,599
$3,600 to $3,799
$3,800 to $3,999
$4,000 to $4,199
$4,200 to $4,399
$4,400 to $4,599
$4,600 to $4,799
$4,800 to $6,399
Number of Samples
Average of all possible
samples
n=2
n=3
n=4
n=5
n=6
n=7
1
1
1
2
1
1
1
1
1
2
1
2
5
6
5
6
6
6
3
4
3
3
2
2
3
3
2
1
1
2
3
10
15
20
22
22
19
17
16
16
16
13
10
7
4
6
1
2
1
1
11
25
42
50
52
52
49
57
46
38
26
21
11
10
3
1
-
7
25
55
78
90
101
108
101
81
61
46
27
10
2
-
1
16
50
84
109
139
151
133
107
79
43
12
-
6
27
61
98
136
150
130
108
62
14
-
12
66
220
495
792
924
792
$2,675
$2,675
$2,675
$2,675
$2,675
$2,675
$2,675
*
*
Expected Value
This means that if the sample is sufficiently large, one takes very little risk in using sample
estimates. (From the above illustration, it might appear that the increase in concentration arises
from the fact that, as the size of the sample increases, the percentage of the population in the sample
becomes higher. Actually, similar results would be observed when the size of sample increases even
though only a small proportion of the universe is included.)
3.4 PREDICTING RELIABILITY OF SAMPLE ESTIMATES
(CONFIDENCE INTERVAL)
We have seen that the precision of a sample can be predicted if we have the distribution of all
sample estimates of a given size for the population. In a real situation, we can not select all possible
samples and examine the estimates derived from them. We must depend upon a single sample.
Therefore, it is necessary to find some measure of the extent to which the estimates made from
various samples differ from the true value; this measure, if it is to be useful, must be one that can be
estimated from the sample itself. Before showing how and why we can do this, we shall introduce
certain definitions and relationships which are derived from the theory of sampling.
17
3.4.1.
Standard Deviation
We shall show that there is a measure of the variability in the original population which can be
estimated from the observations in a single sample, and from which it is possible to estimate the
expected error in the sample mean.
The measure of variability in the population is called the standard deviation; its square is called
the population variance and is designated by the symbol F2 or VAR. The variance of the
population is defined as the average of the squares of the deviations of all the individual
observations from their mean value. Thus, it would be computed by the following process, if all the
values in the universe could be observed:
where the Y's with subscripts are individual observations and is the mean of the N observations
for the N elements in the universe. Note that it has become fairly general practice to denote the
population variance by F2 when dividing by N, and by S2 when dividing by N-1; symbolically,
Its sample equivalent is given by:
where n is the sample size, yi is the sample measurement of a characteristic and
mean.
is the sample
We will use S2 throughout the text because s2 is an unbiased estimate of S2. Note that all results are
equivalent in either notation. Also,
3.4.2 Sampling Error of Sample Means
The variance of the sample means is the average of the squares of the deviations of the means of all
possible samples of size n from the true mean. The variance of
write:
18
is denoted by
and we
where f = (n/N) = sampling fraction.
The square root of the variance of
The sampling error of is:
is called the sampling error for means of samples of size n.
It is important to note that the sampling error varies with the size of the sample, as we would
expect. If we compute the sampling error for all possible samples of sizes shown in Table 3.5, we
see that as the sample size increases, the sampling error becomes smaller and smaller. This is
shown in the following illustration (see Table 3.6). The factor (N - n)/N in the formula for the
variance of is called the finite population correction factor (fpc). As a rule of thumb, if n
#0.05N we can ignore (N - n)/N since its value will be close to 1. Otherwise we should include it in
the formula in order not to severely overestimate the variance of
3.4.3 Illustration
Consider again the population of 12 individuals in Table 3.4. In this case, the true average is
with N =12. We compute S2 as follows:
and S = $1,571.41.
An easier way to calculate S2 is as follows:
Using S, we can compute the sampling error of the sample mean for different sample sizes n. For
example, if the sample size n = 1 then,
19
for n = 2,
The sampling errors for all possible sample sizes are given in the following table.
Table 3.6
Sampling Error of Estimates of Average Income for Various Sample Sizes
Size of Sample
Sampling Error of
Estimated Measure
1
$1,505
2
1,015
3
786
4
642
5
537
6
454
7
383
3.4.4 Interval Estimate (Confidence Interval)
We know that the probability of an estimate being equal to the true value (parameter) is zero for
continuous variables. Thus, it will be more useful if we can state how probable it is that an interval
based on our estimate will contain the parameter to be estimated.
Interval estimator - An interval estimator is a formula that tells us how to use the sample
observations to calculate two numbers that define an interval which will enclose the estimated
parameter with a certain (usually high) probability. The resulting interval is called a confidence
interval and the probability that it contains the true parameter is called its confidence
coefficient. If a confidence interval has a confidence coefficient equal to .95, we call it a 95%
confidence interval.
In general, the confidence interval for a parameter 2 is given by
The symbol t is the value of the normal deviate corresponding to the desired confidence probability.
20
In practice, S2 is not known. Usually, s2, the sample variance is calculated from the sample data and
used as an estimate of S2. If n is large, s provides a fairly good estimate of S; however, for small
samples this may not be the case. Using s, the confidence interval is
For the parameter
the confidence interval is:
(Ignore the fpc if
The value t depends on the level of confidence desired. For large samples, the most common
values (see Appendix I - Normal Distribution Table) are:
t = 1.28 for 80% confidence level
t = 2.58 for 99% confidence level.
If the sample size is less than 30, the percentage points may be taken from the Student's t table (see
Appendix II) with (n-1) degrees of freedom.
3.4.5 Approach to Normal Distribution
Comparing Tables 3.5 and 3.6, it can be seen that as the sample size increases, the sample estimates
differ less and less from the expected value, and at the same time the sampling error becomes
smaller and smaller. In practical sampling problems, where a reasonably large sample is used
(generally 30 or more cases), the distribution of sample results over all possible samples
approximates very closely the normal distribution-- the familiar bell-shaped curve. This is the
result of the most important theorem in statistics, The Central Limit Theorem, which states, briefly,
that sums of random variables have a normal distribution.
For this distribution, the probabilities of being within a fixed range of the average value are well
known and have been published (see Appendix I). These probabilities depend solely on the value of
the sampling error. For example, the probability of being within one sampling error is 68 percent;
for two sampling errors, it is 95 percent; for three sampling errors, it is 99.7 percent.
The implications are of fundamental importance to sampling theory. Suppose we have drawn a
simple random sample from a population, have computed the mean from the sample
and have
estimated the true sampling error of the mean
by means of
How can we infer the
21
precision of this particular sample result? If we set an interval based on
estimate
we can be fairly confident that
around the sample
will give an interval such that one will be
correct about two-thirds of the time that the interval covers the true mean. Similarly,
gives a confidence interval for which the assumption will be correct 95 percent of the time, and for
it will be correct 99.7 percent of the time. To understand the concept, we present the
following illustration.
3.4.6 Illustration
Consider again the same population of 12 individuals in Table 3.4. Let us find the percent of
sample averages in Table 3.5 which differ from the population average
by less
than
less than
and less than
(We are using capital S instead of small s, as
well as because we are dealing with a population and we therefore know its true variance and its
true mean). This is the same as finding the percent of sample averages which fall within
and
with
Consider a sample of size 2. Using Table 3.5
we have:
Table 3.5 shows that there are 42 sample averages that fall within the confidence interval (1660,
3690). That is, 63.6% of sample averages differ from the population average by less than one
sampling error. Similarly, there are 64 averages that fall within the confidence interval (645, 4705);
that is, about 97% of sample averages differ from the population average by less than two sampling
errors. It can easily be seen that 100% of sample averages differ from the population average by
less than three sampling errors. For the normal distribution, we have seen that the probability of
being within one standard (or sampling) error is 68%; for two standard errors, it is 95%; for three
standard errors it is 99.7%. This shows that even for small samples of size 2, the distribution of
sample results over all possible samples approximates very closely the normal distribution. For
larger samples, the results would conform to the normal distribution much more closely. The
percentages of sample averages in Table 3.5 which differ from the population averages by less
than
and
are displayed in Table 3.7.
22
Table 3.7
Concentration of Sample Results Around the Population Average
Sample of
Size n
Percent of sample averages in Table 3.5 differing from the
population average by
less than
less than
less than
1
$1,505
75
92
100
2
1,015
64
97
100
3
786
65
96
100
4
642
64
97
100
5
537
65
97
100
6
454
64
97
100
7
383
65
97
100
68
95
99.7
NORMAL
DISTRIBUTION
Consider the distribution given in Table 3.5 of average income in all possible samples of size 7. A
graph of this distribution is shown in Figure 3.1. This figure appears approximately symmetric,
with a clustering of measurements about the midpoint of the distribution, tailing off rapidly as we
move away from the center of the histogram. Thus, the graph possesses the following properties:
Figure 3.1
DISTRIBUTION OF AVERAGE INCOME IN ALL POSSIBLE SAMPLES OF SIZE 7
23
(1)
The sampling distribution of
size is large.
appears approximately normally distributed when the sample
(2)
The average of all possible sample averages equals the population average.
(3)
The variance of the sampling distribution is equal to
which is less than the
population variance,
Property (1) above is the result of the Central Limit Theorem (CLT), one of the most fundamental
and important theorems in statistics. Briefly stated, the CLT shows that if x1, x2, ... , xn are
independent random variables having the same distribution with mean : and variance F², then for a
large enough sample, the variable
has a standard normal distribution (i.e., mean zero and variance one).
3.4.7 Illustration
Unoccupied seats on flights cause the airlines to lose revenue. Suppose a large airline wants to
estimate the average number of unoccupied seats per flight over the past year. To accomplish this,
the records of 225 flights are randomly selected from the files, and the number of unoccupied seats
is noted for each of the sampled flights.
The sample mean and standard deviation are
11.6 seats and s = 4.1 seats
Estimate the mean number of unoccupied seats per flight during the past year, using a 90%
confidence interval (ignore the fpc).
The 90% confidence interval is,
24
that is, at the 90% confidence level, we estimate the mean number of unoccupied seats per flight to
be between 11.15 and 12.05 during the sampled year.
3.4.8 Sampling and Nonsampling Errors
Estimates are subject to both sampling errors and nonsampling errors. Sampling error arises
because information is not collected from the entire target population, but rather from some portion
of it. Through the use of scientific sampling procedures, however, it is possible to estimate from the
sample data the range within which the true population value (parameter) is likely to be with a
known probability.
Nonsampling error, on the other hand, is defined as a residual category consisting of all other errors
which are not the result of the data having been collected from only a sample. These include errors
made by respondents, enumerators, supervisors, office clerical staff, key coding operators, etc.
3.4.9 Total Error (Mean Square Error)
The total error is the sum of all errors about a sample estimate, both sampling and nonsampling,
both variable and systematic. An illustration of the composition of the total error follows:
Total Error
Sampling Error
Non Sampling Error
Bias
Variable Error
Variable Error
Bias
In practice, the bulk of sampling error consists of variable error, and by contrast the bulk of
nonsampling error is bias.
Mathematically, the total error is represented by the mean square error. In terms of expected values,
the mean square error of the estimate
is denoted by the
and is given by:
which is the average of the squares of deviations of all possible estimates from the parameter.
Recall that
If the estimates are unbiased, the mean square error is equivalent
to the variance.
25
Exercises
3.1
Assume that you know the distribution of the number of cows in a population of eight farms,
as follows:
Farm
1
2
3
4
5
6
7
8
Number of Cows
4
5
0
3
2
1
1
0
a. Calculate the true average number of cows per farm.
b. Calculate the true standard deviation and variance of the number of cows per farm.
c. Take all possible samples of two farms each and calculate the average number of cows per
farm for each sample.
d. Compute the average of the 28 means obtained in c. and compare it with the true mean.
e. Compute the true sampling error
for means of samples of 2 farms.
f. Find the proportions of the 28 values of
that are between
and
How do they compare with
the expected proportion assuming the sampling distribution of
3.2
is normal?
Consider the following distribution of N = 6 population values which represent "the number of
household persons residing in the housing unit." Random samples of size 2 are drawn from
this population.
Housing Unit
(HU)
Household Size
(HH)
1
5
2
6
3
7
4
8
5
9
6
10
a. Show that the mean of this population is
26
and its standard deviation is
b. How many possible random samples of size 2 can be drawn from this population? List
them all and calculate their means.
c. Use the results obtained in b. to assign to each possible sample a probability and construct
the sampling distribution of the mean for random samples of size 2 from the given
population.
d. Calculate the mean and the standard deviation of the probability distribution obtained in c.
3.3
A simple random sample of 100 households will be selected from a village of Nigeria. For
this village
75 Naira per month is spent on electricity and s = 15 Naira. Find a 95%
confidence interval for
3.4
Interpret the interval (ignore the fpc).
A manufacturing company wishes to estimate the mean number of hours per month an
employee is absent from work. The company decides to randomly sample 320 of its
employees from a total of 5,000 employees and monitor their working time for 1 month. At
the end of the month the total number of hours absent from work is recorded for each
employee. If the mean and standard deviation of the sample are
hours and s = 6.4
hours, find a 95% confidence interval for the true mean number of hours absent per month per
employee.
27
CHAPTER 4
BASIC THEORY
________________________________________________________________________________
4.1
The simplest method of probability sampling is simple random sampling (SRS).
To introduce the idea of a simple random sample, let us ask the following questions:
(1) How many distinct samples of size n can be drawn from a population of size N?
(2) How can we define a simple random sample?
(3) How can a random sample be drawn in actual practice?
To answer the first question, we use combinatorics, which allows us to choose n objects out of a
total of N
in
ways, where N! = N (N-1) (N-2) ... (3) (2) (1).
different samples of size n = 2 can be drawn
For instance,
from a population of size N = 5.
To answer the second question, we make use of the answer to the first one and define a simple
random sample of size n (or more briefly, a random sample) selected from a population of size N as a
sample which is chosen in such a way that each of the
probability of being selected. This probability is equal to:
28
possible samples has the same
For example, if a population consists of the N = 5 elements A, B, C, D and E (which might be the
incomes of five persons, the number of persons in five households, and so on), there are
possible distinct samples of size n = 3; they consist of the elements ABC, ABD, ABE, ACD, ACE,
ADE, BCD, BCE, BDE, and CDE. If we choose one of these samples in such a way that each has
the probability
of being chosen, we call this sample a simple random sample.
With regard to the third question of how to take a random sample in actual practice, we could, in
simple cases like the one above, write each of the
possible samples on a slip of paper, put these
slips into a hat, shuffle them thoroughly, and then draw one without looking. Such a procedure is
obviously impractical, if not impossible, given the size of most populations; we mentioned it here
only to make the point that the selection of a random sample must depend entirely on chance.
Fortunately, we can take a random sample without actually resorting to the tedious process of listing
all possible samples. We can list instead the N individual elements of a population, and then take a
random sample by choosing the elements to be included in the sample one at a time, making sure that
in each of the successive drawings each of the remaining elements of the population has the same
chance of being selected. The selection may be accomplished by either sampling with replacement
or sampling without replacement. In sampling from a finite population, the practice usually is to
sample without replacement. Most of the theory which will be discussed is based on this method.
For example, to take a random sample of 12 of a city's 273 drugstores, we could write each store's
name (address, or some other business identification number) on a slip of paper, put the slips of
paper into a box or a bag and mix them thoroughly, and then draw (without looking) 12 of the slips
one after the other without replacement.
Even this relatively easy procedure can be simplified in actual practice; usually, the simplest way to
take a random sample from a population of N units is to refer to a table of random numbers (see
Appendix III). In practice, however, the members of the population are sorted according to certain
rules and then a systematic selection of n elements is carried out. The sample thus obtained is, for all
practical purposes, a simple random sample.
4.1.1 Procedure for Selecting a Simple Random Sample (Use of Random Number Tables)
A practical procedure of selecting a random sample is to choose units one by one with the help of a
table of random numbers. Tables of random numbers are used in practical sampling to avoid the
necessity of carrying out some operation such as selecting numbered chips from an urn to designate
the units to be included in the sample. Moreover, experience has shown that it is practically
impossible to mix a set of chips thoroughly between each selection, that devices such as cards or dice
29
have imperfections in their manufacture, that in thinking of numbers at random people tend to favor
certain digits, etc. Consequently, such methods do not, in fact, give each member of the population
an equal chance of selection. The use of a table of random numbers, however, reduces the amount of
work involved, and also gives much greater assurance that all elements have the same probability of
selection.
Many tables of random numbers are readily available. There are several in the series of Tracts for
Computers, notably tables compiled by Tippett, and by Kendall and Smith. The RAND Corporation
has published A Million Random Digits. Sets are also available in Statistical Tables by Fisher and
Yates, and in other sources. Many of these publications describe the methods of compilation and the
uses of the tables. Some microcomputer packages such as LOTUS spreadsheets also have a random
number generator which can also be used to generate pseudo-random numbers, but these random
number generators provide random numbers between 0 and 1. A table of random numbers is given
in Appendix III.
Typically, these tables show sets of random digits arranged in groups both horizontally and
vertically. To select a set of random numbers, one can start anywhere on a page. Furthermore, after
selecting the first number, one can proceed down a column, across a row, up a column, or in any
other pattern that is desired.
4.1.2 Illustration
To obtain a random number between 1 and a given number, for example between 1 and 273, proceed
as follows: Notice how many digits are in the upper limit number (for 273 there are three digits).
Use this number of columns, counting from the first (or a predetermined) column, and start at the top
(or on a predetermined line). Each line in the set of three columns has a 3-digit number. Choose the
first of these which is between 001 and the given number, inclusive. That is, between 001 and 273 in
our example. Discard numbers which are greater than 273 and discard 000. If more than one
random number is desired, continue down the three columns, choosing each 3-digit number which is
between 001 and 273 until the desired 3-digit random number is obtained. If a number is chosen two
or more times, use it only once.1
Suppose we have a part of a table of random numbers as follows:
1089
9385
6934
0052
5736
1901
5372
8719
7902
8660
1007
9249
5988
6212
Within the limits of the numbers in the examples which follow, we shall select random numbers
from the above table, using a selected number only once.
30
Example A: Select 3 numbers at random between 1 and 10. First choose an arbitrary column,
having decided to let 0 stand for 10. Suppose we choose the fifth column. The first number in the
column is 8; the second number is 7; the third is 8 again. Since 8 has already been selected, we skip
it and take the next number which is 1. The three numbers selected, therefore, are 8, 7, and 1.
Example B: Select 5 numbers at random between 1 and 80. Suppose we take the first two columns
as our choice of a start. First take 10; discard 93 since it is not between 01 and 80; take 69; discard
00 (which represents 100); and take 57, 19, and 53.
4.1.3 Caution in the Use of Random Table
If we use a table of random numbers frequently, we should not always use the same part. For
example, if the first random number is always taken from the same column of the same page, the
same set of numbers would be used repeatedly, and we would not get proper randomization. If
tables of random numbers are used frequently, one can continue from the last random number
selected for the previous sample, or a new starting point should be taken for each use.
4.2
NOTATION
The notation defined in this section is appropriate not only for simple random sampling, but also for
most designs. They provide a key to the system used throughout this manual. Capital letters refer to
population values and lower case (small) letters denote corresponding sample values. A bar (-) over
a letter denotes an average or mean value and (^) over a letter indicates an estimate. We shall use the
following notation:
N =
total number of units in the population
n =
total number of units in the sample
Yi =
value of a characteristic as measured on the I-th unit in the population; I = 1, 2, ... N
yi =
value of a characteristic as measured on the I-th sample unit; I = 1, 2,...n
total value of a characteristic in the population
total value of a characteristic in the sample ( or sum of sample values)
population mean
31
sample mean
population variance
population variance
and
sample variance
sampling rate or sampling fraction
sampling weight (expansion factor)
CV = coefficient of variation
cv = estimated coefficient of variation
As we mentioned earlier, we shall use, unless otherwise mentioned, S² for the population variance.
The difference between S² and F2 disappears for large populations. In general, the population
variance, S², is not known. The sample variance, s², will be used as its estimate; this will hold
throughout the course regardless of the sampling scheme being discussed. It should be noted that in
simple random sampling, s², is an unbiased estimate of S².
4.2.1 Population Values, Their Respective Estimates, and Measures of Precision
The sample estimate of the population total value, Y, is denoted by
and can be written as:
(4.1)
where
is the estimate of the population average,
32
and is given by:
(4.2)
The sampling error of the estimate of
is:
(4.3)
and the sampling error for
is:
(4.4)
The corresponding formulas for the estimated sampling error are:
(4.5)
(4.6)
4.2.2 Illustration
Let us verify equation (4.3) with the data for the 12 individuals discussed previously (see Chapter 3).
We have already used equation (4.4) for the means of samples of sizes 1 and 2 in illustration 3.4.3
(page 19), and their standard errors for different sizes were given in Table 3.6 of Chapter 3. Using
this table, the total income of 12 individuals can be estimated. Equation (4.3) can be expressed as:
Using Table 3.6 of Chapter 3, the sampling error of the estimated total income for samples of size 2
is:
33
4.2.3 Relative Error
Often we wish to consider not the absolute value of the standard error, but its value in relation to the
magnitude of the statistic (mean, total, etc.) being estimated. For this purpose, one can express the
standard error as a proportion (or a percent) of the value being estimated. This form is called the
relative standard error, or coefficient of variation and is denoted by the symbol CV. The true
population CV (for a given characteristic or variable) is defined as follows:
The sample cv (for a given characteristic or variable) is given by:
The true CV of an estimate is denoted by:
where 2 represents any estimate (mean, total, proportion, ratio).
To estimate the true
we use the following formula which uses data from a sample:
One advantage of expressing error as a coefficient of variation is that it is unitless, unlike absolute
measures, like the standard deviation and the sampling error. The CV is useful when making
comparisons because no units enter into play. The population CV refers to the relative sampling
error of means of samples of 1 unit (that is, the population standard deviation expressed as a
proportion of the population mean) and it’s denoted simply by CV (not followed by a parenthesis).
Thus, for the estimate of the total, the true coefficient of variation is:
(4.7)
34
Similarly, for the estimate of the sample mean, the coefficient of variation is:
(4.8)
That is, equation (4.7) is equal to equation (4.8). Therefore,
The corresponding formulas for the estimated (obtained from a sample) coefficient of variations are:
(4.9)
(4.10)
The standard error of the estimated total is N times that of the mean, while the coefficients of
variation of the two estimates are the same; this result is, upon reflection, not unexpected. An
estimated total is obtained by multiplying the sample mean (an estimate) by the number of elements
in the population (a known number); the only source of error is the sample mean. Therefore, we
should expect that, when expressed as a proportion or percentage, the error in the total would be the
same as that in the mean; however, when the error in the total is expressed in absolute terms, it
would be N times as large as the error in the mean, since N is the factor of multiplication.
The big advantage of the coefficient of variation is that it permits comparison of two distributions of
values even though they may be totally unrelated. For example, one could compare the variability of
length of mice tails to weight of elephants. This is possible because variability is expressed relative
to the mean, that is, it is the average variability per unit of mean.
Another way to look at the coefficient of variation is to consider it as a measure of dispersion for
relative deviations. Recall that the variance of Yi was given by
This is a measure of dispersion of the absolute deviations
35
If we now consider the relative deviations
square them, add them, and then average them over N, we get the following expression:
which is called the relative variance of the distribution or simply the relvariance.
If we rearrange terms in the above expression, we get:
The square root of this last expression is the population coefficient of variation mentioned before.
4.3
SAMPLING FOR PROPORTIONS
An important class of statistics for which the formulas for variance and the formulas for determining
the size of sample become particularly simple is the estimation of the proportion of units having a
certain characteristic.
4.3.1 Types of Statistics for Which Proportions are Used
Proportions arise in two ways in statistical analysis. First of all, we are frequently interested in a
statistic that is a proportion, rather than a total or an average; for example, the proportion of the
population that is unemployed, or the percentage of families with income greater than a certain
amount, or the proportion of business firms interested in purchasing a particular product. Secondly,
it may be desired to classify a population into a number of groups, and to find the percentage of the
total population in each of these groups. The groups may have a natural ordering as in distribution
by age (0 to 4 years, 5 to 9, 10 to 14, etc.) or income classes; or they may be groups having no natural
order, such as those in an industrial classification of business firms, where the groups can be
arranged in a number of ways. The analysis is the same whenever the proportion of the total in each
group is the statistic to be measured.
36
4.3.2 Relationship to Previous Theory
Suppose we think of the total population and the sample in the following way. Consider a particular
class of units in which we are interested, and use the following notation:
A =
a =
Total number of units in that class in the population
Number of units in that class in the sample
P =
p =
True proportion of units in that class in the population
Proportion in that class in the sample
Q =
q =
Population proportion not in that class (Q = 1 - P)
Proportion not in that class in the sample (q = 1 - p).
Note that
and
All of the formulas discussed in previous lectures can be applied to this particular case by
considering each member of the population as having a characteristic which can have only one of
two values, either 0 or 1. If the member is in a particular class in which we are interested, the value
assigned is 1; if the member is not in the class, the value is 0. Examining the entire population, we
can see that the A members of the class each have a value of 1; the rest have a value of 0. Adding up
the values for all elements of the population, we get A. In other words, A can be considered as the
equivalent of
the same way as
that we have already discussed. Similarly,
can be considered in
We can now use the previous formulas. It turns out that they are
particularly easy to use in this case.
4.3.3 Applicable Formulas
In sampling for proportions, the following formulas are applicable (with simple random sampling):
(4.11)
and
That is, an estimate of the proportion in the population is obtained by using the sample proportion,
and an estimate of the total number of units having the characteristic is obtained by multiplying the
sample proportion by the total number of units in the population. Also
(4.12)
The population variance is PQ. Note that it is the variance of the population distribution giving the
37
value of 1 or 0 to an element depending on whether or not it is in the class (whether it has the
attribute in question). It can still be estimated by pq, unless n is very small (for example n < 30) in
which case the formula is
The variance of the estimate of the proportion which is computed from all samples of size n is
(4.13)
The estimate of this variance which is made from a single sample of n observations is
(4.13a)
See equations (4.4) and (4.6). These are the same formulas given previously for
substituted for S2, and
with PQ
with pq substituted for s².
Similarly, the formulas given in the previous section for the relative standard error (coefficient of
variation) of a mean and the sampling error of an estimated total are given by:
(4.14)
and,
(4.15)
Again the relative standard error of the total is the same as that of the mean.
The confidence interval for the proportion is derived on the same assumptions as for the quantitative
characteristics, namely that the sample proportion p is normally distributed. From (4.13a) for the
estimated variance of p, one form of the normal approximation to the confidence interval for p is:
(4.16)
where the value t depends on the level of confidence desired (see Section 4.4 of Chapter 3).
38
4.3.4 Illustration
Estimate of sampling error.--Suppose that the proportion of farms that grow maize in a given area is
0.40; what would be the sampling error in estimating this proportion from a random sample of 500
farms, if the total number of farms in the area is 10,000? In this case,
N = 10,000
P = 0.40
n = 500
Q = 0.60
We have
Consequently,
How is the figure of 0.021 to be interpreted? This means that if we establish an interval around the
true proportion of
(or 0.379 to 0.421), there is a reasonably good chance (68 percent)
that a sample of 500 farms will give a proportion somewhere between 0.379 and 0.421. If we double
the interval to get a range of 0.358 to 0.442, the chance is about 95 percent that the sample estimate
will be within that range. If an interval based on three times 0.021, (or 0.063) is used, the chance is
0.997 (or nearly certain) that the sample estimate will be within that range. In normal practice, it is
customary to use a 2-S range (two standard/sampling errors) as providing sufficient confidence in
the accuracy of the estimates. If very important decisions are to be based on the results of the survey,
and we wish to be almost absolutely sure of the range within which the sample estimate will lie, we
can use a 3-S level. It is difficult to conceive of cases in which 3-S would not be sufficient.
In this example, both the proportion (0.40) and the chance of the sample estimate being within a
certain range around this proportion were known. In practice, we are usually interested in the
converse of this situation, in which we do not know the true proportion but we do have a sample
estimate of 0.40 based on a sample of 500 farms out of 10,000. We wish to establish ranges around
the sample figure which will be expected to include the true mean. For all practical purposes, the
same statements can be made as before by substituting the term "true figure" for "sample estimate."
That is, if the sample shows that 0.40 of the farms grow corn and we establish a range
the chances are about 68 percent that this range will include the true figure; the
chances are about 95 percent that the interval 0.358 to 0.442 will include the true figure; etc.
4.3.5 Procedure When P Refers to a Subset of a Class
Frequently, the proportion to be estimated is a percentage, not of the total population, but of a
particular class. For example, we may be interested not in unemployment expressed as a percentage
of the total population, but as a percentage of persons in the labor force; or we may need to know the
proportion of firms with more than 5 employees in a particular industry. In such cases, a very close
approximation to an exact analysis can be made by using the formulas listed above, but interpreting
39
the numbers N and n as applying to the class in which we are interested. That is, N would not be
considered the total population but would be the number of persons in this class (for example, the
total number of persons in the labor force) as estimated from the sample; n would be the number of
sample cases in this class; a would be the number of sample cases in the subset (for example, the
number of unemployed).
4.3.6 Tabled Value of
Table 4.1 shows the value of
for specified values of P and n. As described in sections
4.3.7 and 4.3.8 below, we can use the simplified formula
(4.17)
to compute the standard error of the proportion of units having a certain attribute, if the sample is an
unrestricted (simple) random sample and if N is so large relative to n that the factor (N-n)/N in the
formula has a value very close to 1.
Since the true proportion in the population (P) is not known, the estimate from the sample (p) may
be used in equation (4.17) to give an estimate of the sampling error of p:
(4.18)
Most samples are stratified; that is, they are not simple random samples. We shall see later that this
has the effect of making the sampling error smaller than it would be for a simple random sample of
the same size. However, most samples used in surveys are also clustered and we shall also see that
this has the opposite effect of making the sampling error larger than it would be for a simple random
sample of the same size. When the sample is both stratified and clustered, the formulas for the
standard error become more complex.
Sometimes it is not possible to work out the exact formulas, but a rough estimate of the standard
error can be obtained by using the simple formula of equation (4.17) with an allowance for the
expected net effect of departures from randomness in the sample design. If the units of analysis are
clustered into rather small groups--for example, 5 housing units or 25 persons in a cluster, and the
persons within a cluster are rather similar, as in a cluster located in a rural area--the standard error of
a proportion as read from Table 4.1 might be multiplied by a factor such as 1.25. This factor is a
design effect. In a larger cluster, such as a city block with 40 or 50 housing units, the factor to be
applied to Table 4.1 might be 1.75, even though the persons within the cluster are less alike in an
urban area than in a rural area.
The size of the design effect to be used depends on the sample design and the nature of the
population; it can sometimes be roughly estimated by an experienced sampling statistician, using
40
past experience and mathematical formulas involving the “intraclass correlation.”
Table 4.1
SAMPLING ERROR OF AN ESTIMATE OF A PROPORTION
IN SIMPLE RANDOM SAMPLING
= for specified values of P and n)
P = Proportion of units having a characteristic (Q = 1-P has the same standard error)
n = number of
sample cases
.001
or
.999
.002
or
.998
.01
or
.99
.02
or
.98
.03
or
.97
.04
or
.96
.05
or
.95
.10
or
.90
.15
or
.85
.20
or
.80
.25
or
.75
.30
or
.70
.40
or
.60
.50
50
.0045
.0063
.0141
.0198
.024
.028
.031
.042
.051
.057
.061
.065
.069
.071
100
.0032
.0045
.0099
.0140
.017
.020
.022
.030
.036
.040
.043
.046
.049
.05.0
200
.0022
.0032
.0071
.0099
.012
.014
.016
.021
.025
.028
.031
.033
.035
.035
300
.0018
.0026
.0058
.0081
.0099
.012
.013
.017
.021
.023
.025
.027
.028
.029
400
.0016
.0023
.0050
.0070
.0086
.010
.011
.015
.018
.020
.022
.023
.024
.025
500
.0014
.0020
.0045
.0063
.0076
.0089
.0098
.013
.016
.018
.019
.021
.022
.022
600
.0013
.0018
.0041
.0057
.0070
.0082
.0090
.012
.015
.016
.018
.019
.020
.020
700
.0012
.0017
.0038
.0053
.0065
.0076
.0083
.011
.014
.015
.016
.017
.019
.019
800
.0011
.0016
.0035
..0050
.0061
.0071
.0078
.011
.013
.014
.015
.016
.017
.018
1000
.0010
.0014
.0032
.0044
.0054
.0063
.0070
.0095
.011
.013
.014
.015
.015
.016
1200
.0009
.0013
.0029
.0040
.0049
.0058
.0064
.0087
.010
.012
.013
.013
.014
.014
1500
.0008
.0012
.0026
.0036
.0044
.0052
.0057
.0077
.0093
.010
.011
.012
.013
.013
1700
.0008
.0011
.0024
.0034
.0042
.0049
.0053
.0073
.0087
.0097
.011
.011
.012
.012
2000
.0007
.0010
.0022
.0031
.0038
.0045
.0049
.0067
.0081
.0090
.0097
.010
.011
.011
2500
.0006
.0009
.0020
.0028
.0034
.0040
.0044
.0060
.0072
.0080
.0087
.0092
.0098
.0100
3000
.0006
.0008
.0018
.0026
.0031
.0039
.0040
.0055
.0066
.0073
.0079
.0084
.0090
.0092
3500
.0005
.0008
.0017
.0024
.0029
.0034
.0037
.0051
.0061
.0068
.0073
.0078
.0083
.0084
4000
.0005
.0007
.0016
.0022
.0027
.0032
.0035
.0047
.0057
.0063
.0068
.0073
.0077
.0079
4500
.0005
.0006
.0015
.0021
.0025
.0030
.0033
.0045
.0054
.0060
.0065
.0069
.0073
.0074
5000
.0004
.0006
.0014
.0020
.0024
.0028
.0031
.0042
.0051
.0057
.0061
.0065
.0069
.0071
In practice the sample value p would be used, inasmuch as the population value P would not be
known.
For values of n greater than 5,000, when n is multiplied by 100, the standard error is divided by
10.
4.3.7 The Design Effect (DEFF)
41
The design effect or DEFF is the ratio of the variance of the estimate obtained from the more
complex sample (described later in this text) to the variance of the estimate obtained from a
simple random sample of the same size. For instance, if
is the variance of the
estimate, say
obtained from a complex sample, and
is the variance of the same
estimate based on a simple random sampling, then
and
where
= variance obtained from the more complex design
This approach is commonly used by practical samplers. For many situations where we can not
estimate directly the variance of the estimate, we may be able to guess fairly well both the
element variance S2 and DEFF from experience with similar past data. This comprehensive
factor attempts to summarize the effects of various complexities in the sample design especially
those of clustering.
4.3.8 Finite Correction Factor (or Finite Population Correction Factor)
The exact formula for the relative variance (square of the coefficient of variation) of a mean for a
simple random sample,
or
can be divided into two parts:
and
or
The only way the size of the total population comes into the formula is in the expression
This is usually called the finite population correction factor (fpc). If the population were infinite
this factor would be 1 and the formulas would be much simpler:
or
42
The value of
is approximately equal to
where
is the sampling
rate. If the sampling rate is small, say less than 0.05, the effect of the finite population correction
factor is very small and, for all practical purposes, the finite population correction factor can be
ignored.
4.3.9 Simplification for Large Populations
With large populations and small sampling rates, the fpc can be ignored and the formulas become
simpler.
Simplified Formulae
True Value
Variance of the mean
Variance of a proportion
Coefficient of variation of the
mean
Coefficient of variation of a
proportion
Variance of a total
Variance of the total number of
units having an attribute
Coefficient of variation of a
total
Coefficient of variation of total
number of units having an
attribute
43
Estimate
CHAPTER 5
ESTIMATION OF SAMPLE SIZE
_______________________________________________________________________
5.1
SPECIFIC CONSIDERATIONS FOR DETERMINING THE
SAMPLE SIZE
One of the first questions which a statistician is called upon to answer in planning a sample survey
refers to the size of the sample required for estimating a population parameter with a specified
precision. Making a decision about the size of the sample for the survey is important. Too large a
sample implies a waste of resources, and too small a sample diminishes the utility of the results.
When considering sample size determination, there are three very important concerns: ACCURACY,
PRACTICALITY, and EFFICIENCY.
5.1.1.
Accuracy
Accuracy can be defined as an inverse measure of the total error. Total error is the sum of sampling
error (SE) and nonsampling error (NSE). Sampling error arises because only a part of the population
is observed, and not all of it. The terms PRECISION and RELIABILITY are associated with
sampling error. Estimator A is more precise or more reliable than estimator B if the sampling error
of A is smaller than the sampling error of B. Nonsampling errors are usually biases which are very
often due to poor quality control of the survey operations (poor questionnaire design; interviewers
that are not well trained; response errors; etc.)
5.1.2.
Practicality
To obtain an accurate estimate, both sampling and nonsampling errors must be reduced. However,
accuracy may come into conflict with practicality because:
5.1.3.
a.
to reduce sampling errors and increase precision, the sample size must be large.
b.
too large a sample can impose an excessive burden on the limited resources available
(and resources are usually very limited) and increase the likelihood of nonsampling
errors.
Efficiency
A further concern is that a given sample size can produce different levels of precision depending on
which sampling techniques are chosen. This concept is known as the statistical efficiency of the
design. The most efficient design is the one that gives the most precision for the same sample size.
Therefore, expert sample design is needed in the determination of the optimal sample size.
44
Example 5.1
A population consists of N = 5000 persons. A simple random sample without replacement (SRSWOR) of size n = 50 included 10 persons of Chinese descent.
A 95% confidence interval for P, the proportion of persons of Chinese descent in the population, is:
The conclusion is that between 8.9% and 31.1% of the population is of Chinese descent. This
interval is too wide to be useful. There are two ways in which a narrower interval could be obtained:
<
<
by lowering the confidence level, or
by increasing the sample size
There is a point at which lowering the confidence level is not attractive. We shall consider the
problem of determining the sample size necessary to produce a fixed level of precision.
The following eight steps are taken into account when determining the sample size. We will study
each one in detail.
(1)
Degree of precision desired
(2)
Formula to connect n with desired precision
(3)
Advance estimates of variability in population
(4)
Cost and operational constraints
(5)
Expected sample loss due to nonresponse.
(6)
Number of different characteristics for which specified precision is required.
(7)
Population subdivisions for which separate estimates of a given precision are
required. These are also called domains of estimation.
(8)
Expected gain or loss in efficiency.
45
5.2
Degree of Precision
The precision of an estimate refers to the amount of variable error, mainly sampling error, contained
in an estimate. To lower the sampling error, that is, to increase the precision, we want n to be
sufficiently large. Therefore, we decide on a target value for the precision of the estimate. The
degree of precision desired can be stated in terms of:
(1)
The absolute error (E) for the estimate
where is an estimate of the parameter 2 and (1- ") is the degree of confidence
desired.
The absolute error E is measured in the same unit used to measure the variable. For
example, E = 5 hectares or E = $10,000 or E = 25 persons.
(2)
The relative error (RE) for the estimate
This is E expressed as a proportion (or percentage) of the true value of the parameter
being estimated. For example, if E = 5 hectares and the true value of the parameter is
100, then RE = 5/100 = 0.05 or 5%.
(3)
The target coefficient of variation (cv) for the estimate (v0)
We set the cv (also known as the relative standard error) for the estimate equal to a
target value v0. For example, we can have:
Depending on which of the three ways we use to specify the precision, the formula for
n will be different. The values of E, RE and " are usually decided by the user of the
data in conjunction with the statistician.
5.3
Formula that Connects n (sample size) with the Desired Degree of Precision
The following terms are used in the formulas outlined below.
46
S2
n
=
=
the population variance;
the desired sample size
CV
=
the population coefficient of variation,
N
=
Number of units in the population.
k
=
1 for 68% confidence
2 for 95% confidence
3 for 99.7% confidence
=
the coefficient of variation of the estimator, where
v0
=
specified target value for estimate’s
E
=
Absolute error.
RE
=
Relative error
Note:
the level of confidence states the probability that the n determined will provide the degree
of precision specified. For example, a 95% level of confidence means that, except for a
small chance (5%), we can be 95% certain that the precision specified will be reached with
the calculated n. This is equivalent to saying that the acceptable risk is 5% that the true 2
will lie outside of the range specified in the confidence interval.
5.3.1.
Sample size needed to estimate a mean with absolute error E
could be used instead.
where
is the population mean.
The sampling error of a mean using simple random sample is given by:
Now,
where k is a multiple of the sampling error, selected to achieve the specified
degree of confidence. Therefore, if we substitute
for (E/k), we get:
(5.1)
If we solve for n in (5.1) above, we get:
(5.2)
47
If the population size is large and n # 0.05N, the finite population correction factor in equation (5.1)
can be ignored because its effect would be minimal. In this case, we have:
(5.3)
Example 5.2
Consider a population consisting of 1,000 farms for which the population variance of the number of
cattle per farm is 250 (N = 1,000 and S² = 250). Let us estimate the average number of cattle per
farm from a sample; we wish to have reasonable confidence that the estimate will be close to the true
value. Suppose the sample estimate is to be in error by no more than 1 (one head of cattle) from the
true average, and we require an assurance of 95 chances out of 100 that the error will be no larger
than 1. In this case,
E=1
E² = 1
N = 1,000
S² = 250
k = 2 (since 2 gives us almost a 95% confidence level); k² = 4.
Applying equation (5.2), we see that n must be equal to or greater than
If in the same situation we are satisfied with an error of not more than 3, with a confidence level of
95 percent, the only change in the formula would be in the values of E and E², as follows:
E=3
and E² = 9.
Then we would have,
and a sample of 100 cases would be sufficient.
Example 5.3
We wish to estimate the average age of 2,000 seniors on a particular college campus. How large a
SRS must be taken if we wish to estimate the age within 2 years from the true average, with 95%
confidence? Assume S2 = 30.
E=2
and k = 2
48
5.3.2.
Sample size needed to estimate a proportion with absolute error E
The sample size n to estimate a population proportion P is obtained from equation (5.2); in this
equation,
but we’ll use the approximation S² = PQ (i.e., we’ll assume N is big
enough so that N/(N-1) is very close to 1):
(5.4)
And for a large population size (n # 0.05N), we have from equation (5.3),
(5.5)
Example 5.4
Refer to Example 5.1 on page 50. Suppose we would like to estimate P, the proportion of persons of
Chinese descent to within ± 3%, with 95% confidence. What sample size do we have to choose to
achieve this target? Assume P to be no larger than 1/2.
Now, let’s assume that we know that P # 0.25. What is the required sample size?
5.3.3.
Sample size needed to estimate a total with absolute error E
Using equation (4.3), and letting
we get the following formula for n:
(5.6)
If we ignore the fpc, we have:
49
(5.7)
5.3.4.
Sample size needed to estimate the number of units that possess a certain attribute
with absolute error E
To obtain the n necessary to estimate A, the number of units that possess a certain characteristic,
simply substitute PQ in place of S2 in equations (5.6) and (5.7).
5.3.5.
Sample size formulas when the error is expressed in relative terms (RE)
We can obtain formulas for estimates when the desired error is expressed in relative terms instead of
absolute terms. For relative errors (RE), if (RE) is a proportion of the estimates, substitute (RE/k) for
(or
in equation (4.7) or (4.8)). We will denote by cv the estimated coefficient of
variation. The true population coefficient of variation is denoted by CV. We then have:
(5.8)
Note:
This applies to both means and totals.
If we ignore the fpc, then equation (5.8) becomes
(5.9)
NOTE 1:
In actual practice, we usually do not know S² or (CV)2. Indeed we do not even know s²
in advance of the survey. Instead, we use rough estimates of S² or (CV)2, obtained by
the methods discussed in section 8 of chapter 6.XXXX
NOTE 2:
For the mean and the total, it is better to express the variance in relative rather than
absolute terms, for two reasons:
(1) Most importantly, because a population’s relative variance is more stable than its
absolute variance. A guess or estimate of the population coefficient of variation
CV (from past data or from similar populations) is likely to be closer to the true
value than a guess or estimate of the variance.
(2) The formula for n is the same for estimators of means or totals when it is
expressed in terms of the coefficient of variation.
50
NOTE 3:
To estimate the proportion P, it is preferable to use the absolute error previously
discussed because the proportion is itself a relative quantity, so that taking the
percentage of a percentage can become confusing.
To obtain the formula for the sample size required to estimate a population proportion when the error
is expressed as relative error (RE), use equation 5.8
where we replace (CV)2 = Q/P. That is, we get
(5.10)
If we ignore the fpc, equation (5.10) becomes:
(5.11)
Example 5.5
We would like to carry out a survey to estimate the total area in hectares of the farms in a population.
The estimate should be within 10% of the true value. How many farms should be surveyed? (In a
pilot survey, we estimated the population coefficient of variation, CV, of the variable farm size to be
1.2). Use 95% confidence.
5.3.6.
Sample size formulas when the error is expressed in terms of the coefficient of
variation
Equation (5.8) can be expressed in terms of the coefficient of variation. If
is the
population coefficient of variation and
is a specified target value for an estimate's
coefficient of variation, then (5.8) becomes,
51
(5.12)
If we ignore the fpc, equation (5.9) gives :
(5.13)
Equations (5.12) and (5.13) apply to both means and totals.
Let’s consider Example 5.5 and use coefficients of variation to solve the problem.
Example 5.6
Suppose that a survey was carried out to estimate the total area in hectares of the farms in a
population. The estimate should be within 10 percent of the true value, with 95 percent confidence.
How many farms should be surveyed? [In a pilot survey, we estimated the population coefficient of
variation CV of the variable "farm size" to be 1.2].
In this case,
k=2
CV = 1.2
RE = .10
Substituting in equation (5.13), we have,
Example 5.7
The results from a pilot test are used to estimate
5,000 households.
s
=
=
and S for the variable ‘income’ in a population of
$14,852 per household
$12,300
A full scale survey is planned. What should be the sample size for this survey if we want to estimate
the mean income per household with a cv no larger than 5%?
The population coefficient of variation (CV) is estimated by:
52
5.4.
Advance Estimates of Population Variances
In the preceding section, we noted that most of the sample size formulas are written in terms of the
population variance. In practice this is unknown and it must be estimated or guessed. There are five
ways of estimating population variances for sample size determination.
Method 1:
Select the sample in two steps, the first being a simple random sample of size n1 (the
first sample) from which estimates s1² and p1 of S² and P, respectively, are obtained.
Then use this information to determine the required n (the final sample size).
Method 2:
Use the results of a pilot survey. This is one of the more commonly used methods.
Method 3:
Use the results of previous samples of the same or similar population.
Method 4:
Guess about the structure of the population and use some mathematical results.
Method 5:
(Only for qualitative characteristics.) If the statistic to be measured is a proportion,
then make a fairly good guess of P (the proportion in the population).
Method 1 carries out the survey in two steps. In the first step, only a subsample (a random part of
the total sample) is enumerated. An analysis of this part permits one to estimate the variance and to
make revisions in the total size of the sample, if necessary. In the second step, the remainder of the
sample is enumerated in accordance with these changes, if any. This method gives the most reliable
estimates of S² or P, but it is not often used, since it slows up the completion of the survey.
Method 2 is one of the more commonly used methods. It serves many purposes, especially if the
feasibility of the main survey is in doubt. If the pilot survey is itself a simple random sample, the
preceding methods apply. But often the pilot work is restricted to a part of the population that is
convenient to handle or that will reveal the magnitude of certain problems.
Method 3 is also a very commonly used method. This method points to the value of making
available, or at least keeping accessible, any data on standard errors obtained in previous surveys.
Unfortunately, the cost of computing standard errors in complex surveys is high, and frequently only
those standard errors needed to give a rough idea of the precision of the principal estimates are
computed and recorded. If suitable past data are found, the value of S² may require adjustment for
time changes. Experience indicates that the variance of an item tends to change much more slowly
over time than the mean value of the item itself. Even if the mean value changes, the relative error
may be quite stable.
53
Method 4 uses some mathematical results. Deming (1960) showed that some simple mathematical
distributions may be used to estimate S² from a knowledge of the range (h) and a general idea of the
shape of the distribution of the characteristic of interest:
S² = 0.0289 * h² for a normal distribution
S² = 0.083 * h² for a rectangular distribution (uniform)
S² = 0.056 * h² for a distribution shaped like a right triangle
S² = 0.042 * h² for an isosceles triangle.
Approximate Values of S
General Shape of Distribution
Normal
.17 * h
Equilateral Triangle
.20 * h
Right Triangle (Skewed Distribution)
.24 * h
Uniform Distribution
.29 * h
These relations do not help much if the range, h, is large or poorly known. However, if h is large,
good sampling practice is to stratify the population so that within any stratum the range is
significantly reduced. Usually the shape also becomes simpler (closer to rectangular) within a
stratum. Consequently, these relations are effective in predicting S², hence h, within individual
strata.
Example 5.8
The universities in the State of Maryland were classified according to the number of enrolled
students into four size classes. The standard deviation within each class is shown below:
Size Class (i)
Enrollment Level, Xi
< 1,000
1,000-3,000
3,000-10,000
Si
236
625
2,008
> 10,000
10,023
If you knew the class boundaries but not the values of Si, how well could you guess the values by
using the Deming method? (No university has fewer than 200 enrolled students and the largest has
about 50,000).
We do not know the number of universities in each size class; therefore, we cannot obtain a
frequency distribution that would show us the general shape of the distribution. A conservative
estimate would be that the distribution is uniform. In this case, Si would be given by 0.29 * hi, where
54
hi is the range of each class.
S1
S2
S3
S4
=
=
=
=
0.29 (1,000 - 200)
0.29 (3,000 - 1,000)
0.29 (10,000 - 3,000)
0.29 (50,000 - 10,000)
=
=
=
=
232
580
2,030
11,600
Method 5: if the statistic to be measured is a proportion--for example, the proportion of farms
growing corn--the population variance is approximately PQ. It is only necessary to be able to make a
fairly good guess at P in order to estimate S². As long as the guess is reasonably close, we will get a
good estimate of S². For example, suppose the true value of P is 0.4; then the value of S² = PQ
would be 0.4 x 0.6 = 0.24. Suppose we made a rather poor guess of P, say 0.3. We would then
estimate the value of the variance as 0.3 x 0.7 = 0.21, which differs from the true value by only about
10 percent. Note that we can also estimate S² by setting S² = PQ = (1/2)(1/2) because the formula for
n is maximized when P = Q = 1/2. This latter is called a "conservative estimate," because we can
never do worse than that.
5.5.
Cost and Operational Constraints
Let us recall that the total error is composed of both bias and variance. High sample sizes reduce the
variance (i.e., yield high precision) but tend to increase cost and operational difficulties, which
translates into larger nonsampling errors.
To reduce the incidence of nonsampling errors, a survey needs:
(1)
(2)
good quality control
sufficient resources.
However, in a real survey setting, there exist constraints with respect to:
(a)
(b)
(c)
(d)
(e)
budget
field conditions
field and office personnel
time
equipment and materials, etc.
Hence, in addition to precision, we also need to consider the maximum sample size that can be
handled by the available resources. It may be necessary to limit the sample size in order to stay
within budget and operational constraints.
If the maximum practical sample size is much smaller than that required to achieve the specified
precision, calculations can be made to estimate the level of precision that could be expected from the
actual sample size. If this level is not acceptable, greater resources have to be allocated to
accommodate a larger sample size.
To compromise between precision and practicality, we may take a sample size that is somewhere
between the constraint-based and the precision-based sizes.
55
5.6.
Expected Sample Loss Due to Nonresponse
If past experience indicates that a certain level of nonresponse can be present, we may want to inflate
the calculated sample size to compensate. This is because our calculations were based on a 100
percent response. If we do not obtain all the interviews, then the estimates will be based on a
number smaller than the calculated n and will, therefore, have a greater variance than expected.
Inflating Procedure
We compute the inflated sample size n’ from the following relationship:
where r is an estimate of the expected response rate and it can be obtained from previous rounds of
the same survey, previous experience with similar surveys, a pilot (pre-test), etc.
For example, we calculate n to be 1,000 units. Based on the results of a pilot survey, we anticipate
the response rate to be 70 percent.
Our inflated n will be: n' = 1000/.70 = 1,429
If our assumption was correct, we should get back 70% of 1,429 = 1,000.
Therefore, our estimates will be based on the same number of units as expected and the target
precision will be attained.
Important Note
Inflating the sample size when there is nonresponse only helps compensate for the resulting loss in
precision. It does nothing for diminishing the resulting nonresponse bias.
5.7.
Number of Different Characteristics Requiring a Specified Precision
In most surveys information is collected from a sampling unit for more than one characteristic. One
method of determining sample size is to specify margins of error for the characteristics that are
regarded as most vital to the survey. An estimation of the sample size needed is first made
separately for each of these important characteristics.
When the estimations of n have been completed for each of the most important characteristics, it is
time to take stock of the situation. It may happen that the n's required are all reasonably close. If the
largest of the n's falls within the limits of the budget, this sample size is selected. More commonly,
there is sufficient variation among the n's so that we are reluctant to choose the largest, either for
budgetary considerations or because this will give an overall level of precision substantially higher
than originally contemplated for the other characteristics. In this event the desired level of precision
may be relaxed for some of the characteristics in order to permit the use of a smaller value of n.
56
In some cases the n's required for different characteristics are so different that some of them must be
dropped from the survey; with the resources available the precision expected for these characteristics
is totally inadequate. The difficulty may not be merely one of sample size. Some characteristics call
for a different type of sampling scheme than others. With populations that are sampled repeatedly, it
is useful to gather information about those characteristics that can be combined economically in a
general survey and those that need special methods. As an example, a classification of
characteristics into four types, suggested by experience in regional agricultural surveys, is shown in
Table 5.1. In this classification, a general survey means one in which the units are fairly evenly
distributed over some region as, for example, by a simple random sample.
Table 5.1.
AN EXAMPLE OF DIFFERENT TYPES OF ITEMS IN REGIONAL SURVEYS
Type
Characteristics of item
Type of Sampling Needed
1
Widespread throughout the region,
occurring with reasonable frequency in all
parts.
A general survey with low sampling
fraction.
2
Widespread throughout the region but
with low frequency.
A general survey, but with a higher sampling
fraction.
3
Occurring with reasonable frequency in
most parts of the region, but with more
sporadic distribution, being absent in
some parts and highly concentrated in
others.
For best results, a stratified sample with
different intensities in different parts of the
region (Chapter 5). Can sometimes be
included in a general survey with
supplementary sampling.
4
Distribution very sporadic or concentrated
in a small part of the region.
Not suitable for a general survey. Requires a
sample geared to its distribution.
Example
The following coefficients of variation per unit were obtained in a farm survey in Iowa, the unit
being an area 1 square mile.
Estimated cv
Item
Acres in farms (Y1 )
0.38
Acres in corn (Y2 )
0.39
Acres in oats (Y3 )
0.44
Number of family workers (Y4 )
1.00
Number of hired workers (Y5 )
1.10
Number of unemployed (Y6 )
3.17
57
A survey is planned to estimate acreage characteristics with a cv of 2½% and numbers of workers
(excluding unemployed) with a cv of 5%. With simple random sampling, how many units are
needed? How well would this sample be expected to estimate the number of unemployed? The
results are displayed in the following table:
Item
Estimated
cv
Target cv
for Estimate
n
Expected cv1
with n = 484
Y1
0.38
0.025
232
0.017
Y2
0.39
0.025
244
0.018
Y3
0.44
0.025
310
0.020
Y4
1.00
0.050
400
0.046
Y5
1.10
0.050
484
0.050
Y6
3.17
--
--
0.144
Comments
1.
Assuming cost and workload constraints permitted it, a sample of 484 segments should be
taken (the largest calculated size). This sample size should guarantee the desired precision
(or better) for the estimates of Y1 through Y5. As noted in the last column, the cv of the
estimate is expected to be either as small as desired or smaller, if n = 484 is used.
2.
As far as the estimate of Y6, a cv of approximately 14% can be expected if a sample size of
484 is used. Although it is true that the precision will be lower for this estimate than for the
others, this is not critical because sponsors and data users did not require higher precision.
5.8.
Population Subdivisions Requiring Separate Estimates of a Specified Precision
If there are subpopulations or domains of estimation for which separate estimates of a given
precision are required, we must resort to a different sampling strategy, such as the use of stratified
sampling with different sampling rates by stratum.
Under stratified sampling, each stratum or domain is considered a "population" in its own right. We
can then apply the same principles to calculate separate sample sizes within each stratum to meet the
precision requirements for the domain estimates. Often the same precision is required in each
domain. If the variability and the cost within the domain are similar from domain to domain, then
the sample sizes will be about the same in all domains.
1
58
The overall sample size would then be the sum of the stratum sample sizes. The overall estimate for
the whole population would have a higher precision than the stratum-level estimates.
For example, if the unemployment rate is to be measured at the national level with x% target cv, the
national sample size computed would be n, say 5,000 households. On the other hand, if the
unemployment rate is needed for each of 5 regions of the country, all with the same precision, the
total (national) sample size required would be around 5n or 25,000 households. The national
estimate would have a precision much higher than originally planned.
5.9.
Expected Gain or Loss in Efficiency
The formulas discussed so far are all based on simple random sampling (SRS). Let us denote as nsrs,
the sample sizes obtained from those formulas.
However, as will be seen later on, simple random sampling is rarely used in complex surveys. The
efficiency of the design actually used is measured by comparing the variance of the estimator 2
obtained with the complex design and the variance of the same estimator with SRS.
-
If the complex design is more efficient, that is, inherently tends to produce a lower variance
than SRS, then our precision is likely to be better than expected with nsrs.
-
If, on the other hand, the complex design is less efficient than the SRS one, that is, has an
inherent tendency to produce a higher variance for 2 than SRS, then our expected precision
level may not be met with the calculated nsrs. In this case, it would be desirable to inflate
nsrs beforehand.
As we study different sampling schemes, we will know which are more efficient than SRS and which
are less. Here are some examples:
Usually more efficient than SRS:
- stratified sampling, implicit stratification in systematic selection, use of more efficient
estimators (e.g., ratio estimators of total)
Generally less efficient than SRS:
- cluster sampling (used for convenience and cost effectiveness)
The efficiency of a particular sample design is measured by the design effect. XXX (see Chapter 4).
5.10.
Relationship Between Size of Sample and Size of Population
We return to certain implications of the basic formula from which all the above formulas are derived.
That basic formula was given in equation (4.4) in Chapter 4 as:
59
(5.14)
Notice that the sampling variance of the mean is equal to the variance of individual observations (S²)
in the population multiplied by the factor
What happens when the sample increases
from its smallest possible size (n = 1) to its largest possible size (n = N)? When n = 1,
This states the familiar fact that the variance of the means of samples of one unit is the same as the
variance of individual observations in the population. At the other extreme, when n = N,
That is, if the sample includes the entire population, the mean is estimated without sampling error.
For sample sizes between these extremes, how does the sampling fraction (sampling rate) n/N affect
the standard error? The answer, sometimes surprising to students, is that for populations that are
large relative to the sample size, the absolute size of the sample (n) and not the sampling fraction n/N
determines the precision of the estimated mean. This follows from the fact that when N is large
relative to n, the factor [(N-n)/N] .1. (The symbol . stands for "is approximately equal to"). Then
thus, it is clear that the error depends on S² and n, and not on
On the other hand, for small populations the sampling fraction does have an effect. For example,
suppose two populations have the same mean and the same variance:
and S² = 100, while
N1 = 40 and N2 = 400. If we take the same size of sample from each, say n1 = n2 = 20, the standard
errors are related (in an inverse way) to the sampling fractions. Equation (5.14) then gives:
N
n
n/N
1st population
40
20
.50
1.6
2nd population
400
20
.05
2.2
The number of sample units needed to achieve the same precision would be greater for the second
(larger) population. However, the number of sample units needed to achieve a given reliability does
not increase indefinitely as the number of elements in the population increases. In other words, we
reach a point in which adding an extra sampling unit does not produce a sizable reduction in
variance.
60
Example
Table 5.2 below shows the size of sample necessary to give an estimate of the population mean
within a 5 percent error (E = 0.05) of the estimate (with confidence coefficient k = 2) for populations
ranging in size from 50 to 10,000,000 elements and with (CV)² = .10 in each case. These results
were obtained using equation (4.8) of Chapter 4. Equation 4.8 is given by:
Table 5.2
NUMBER OF ELEMENTS NECESSARY FOR FIXED PRECISION: (CV)2 = .10
(E = .05 and k = 2)
_
Number of elements
in the population
(N)
Number of elements
required in sample
(n)
n/N
50*.....................
38
.76
100.....................
62
.62
1,000..................
138
.14
10,000................
158
.016
100,000..............
160
.0016
1,000,000...........
160
.00016
10,000,000.........
160
.000016
____________________________________________
* Use equation (4.8) when N is smaller than 50.
As an example, let’s calculate the first value of n in Table 5.2. Since N = 50 and is very small for a
population value, we have to use the formula for n that contains the finite population correction
factor (N-n)/N. The series of steps leading to the number 38 in Table 5.2 is shown below.
61
The objective is to leave n on one side of the equation in terms of the other components.
Now, we know that (CV)2 = 0.10. This is the population coefficient of variation and is given to us
as a known value. However, we do not know the value of
but we can obtain it by using
the following:
Consequently, we have the following value for n when N = 50:
Table 5.2 shows that for small populations, the sample size needed for a given accuracy does
increase as the population increases, but the sample size approaches a fixed number as the population
gets very large. The largest size of sample we would ever need for this accuracy (with CV² = .10) is
160 elements, and this is approximately the number we would need whether there are 10,000 or
10,000,000 elements in the population. Furthermore, if we had used a sample of 160 for a
population even as small as 1,000, the sample would be somewhat larger than necessary; but the
excess would not have been very serious.
62
Chapter 5
Simple Random Sampling Problems
1.
State park officials were interested in the proportion of campers who consider the campsite
spacing adequate in a particular campground. They decided to take a simple random sample of
size n = 30 from the first N = 300 camping parties which visit the campground. Let yi = 0 if the
head of the i-th party sampled does not think the spacing is adequate and yi = 1 if he does (i = 1,
2, . . . , 30). Use the data below to estimate P, the proportion of campers who consider the
campsite spacing adequate. Find the sampling error of the estimate and its coefficient of
variation.
Camper Sampled
Response yi
1
2
3
.
.
.
20
30
1
0
1
.
.
.
1
1
Answers: p = 0.8333; samp. error(proportion) = 0.065653216; cv(proportion) = 7.88%
2.
Use the data in Exercise 1 to determine the sample size required to estimate P with a bound on
the error of estimation of magnitude E = 0.05.
Answer: n = 125
3.
A simple random sample of 100 water meters within a community is monitored to estimate the
average daily water consumption per household over a specified dry spell. The sample mean
and sample variance are found to be
and s2 = 1252, respectively. If we assume that
there are N = 10,000 households within the community, estimate :, the true average daily
consumption, find the sampling error of the mean and its coefficient of variation.
Answers: Mean = 12.5; samp. error (mean) = 3.3538361; cv(mean) = 28.31%
4.
Using Exercise 3, estimate the total number of gallons of water, T, used daily during the dry
spell. Find the sampling error of the total and its coefficient of variation.
Answers: Total = 125,000; se(total) = 35,383.61; cv(total) = 28.31%
63
5.
Resource managers of forest game lands are concerned about the size of the deer and rabbit
populations during the winter months in a particular forest. As an estimate of population size,
they propose using the average number of pellet groups for rabbits and deer per 30 foot square
plots. Using an aerial photograph, the forest was divided into N = 10,000 thirty foot square
grids. A simple random sample of n = 500 plots was taken, and the number of pellet groups
was observed for rabbits and for deer. The results of this study are summarized below:
Deer
Rabbits
Sample mean = 2.30
Sample variance = 0.65
Sample mean = 4.52
Sample variance = 0.97
Estimate :1 and :2, the average number of pellet groups for deer and rabbits respectively, per
30 square foot plots. Find the sampling error and the coefficient of variation of each mean.
Answers: Mean(deer) = 2.30; se(deer) = 0.035142567; cv(deer) = 1.53%
Mean(rabbits) = 4.52; se(rabbits) = 0.042930176; cv(rabbits) = 0.95%
6.
A simple random sample of n = 40 college students was interviewed to determine the
proportion of students in favor of converting from the semester to the quarter system. If 25 of
the students answered affirmatively, estimate the proportion of students on campus in favor of
the change. (Assume N = 2000.) Find the sampling error of the proportion and its coefficient
of variation.
Answers: p = 0.625; se(proportion) = 0.078308752; cv(proportion) = 12.53%
7.
A dentist was interested in the effectiveness of a new toothpaste. A group of N = 1,000 school
children participated in a study. Prestudy records showed there was an average of 2.2 cavities
every six months for the group. After three months on the study, the dentist sampled n = 10
children to determine how they were progressing on the new toothpaste. Using the data below,
estimate the mean number of cavities for the entire group and find the sampling error and the
coefficient of variation of the mean.
Child
Number of Cavities in the Three-Month Period
1
2
3
4
5
6
7
8
9
10
0
4
2
3
2
0
3
4
1
1
64
Answers: Mean = 2.0; se(mean) = 0.469039; cv(mean) = 23.45%
8.
The Fish and Game department of a particular state was concerned about the direction of its
future hunting programs. In order to provide for a greater potential for future hunting, the
department wanted to determine the proportion of hunters seeking any type of game bird. A
simple random sample of n = 1000 of the N = 99,000 licensed hunters was obtained. If 430
indicated they hunted game birds, estimate P, the proportion of licensed hunters seeking game
birds. Find the sampling error and the coefficient of variation of the proportion.
9.
Using the data in Exercise 8, determine the sample size the department must obtain to estimate
the proportion of game-bird hunters, given an error of estimation E = 0.02.
Answer: n = 2,300
10.
A company auditor was interested in estimating the total number of travel vouchers that were
incorrectly filed. In a simple random sample of n = 50 vouchers taken from a group of N =
250, 20 were filed incorrectly. Estimate the total number of vouchers from the N = 250 that
have been filed incorrectly, and find its sampling error and coefficient of variation. (Hint: If P
is the population proportion of incorrect vouchers, then NP is the total number of incorrect
vouchers. An estimator of NP is Np which has an estimated variance given by
)
11.
A psychologist wishes to estimate the average reaction time to a stimulus among 200 patients
in a hospital specializing in nervous disorders. A simple random sample of n = 20 patients was
selected and their reaction times were measured with the following results:
Estimate the population mean, :, and find the sampling error and the coefficient of variation of
the mean.
12.
In Exercise 11, how large a sample should be taken in order to estimate : with an error of
estimation equal to one second? Use 1.0 second as an approximation of the population
standard deviation.
Answer: n = 4.
13.
The manager of a machine shop wishes to estimate the average time that it takes for an operator
to complete a simple task. The shop has 98 operators. Eight operators are selected at random
and timed. The following are the observed results:
65
Time in Minutes
4.2
5.1
7.9
3.8
5.3
4.6
5.1
4.1
Estimate the average time it takes an operator to complete a simple task and find the sampling
error and the coefficient of variation of the average time.
14.
A sociological study conducted in a small town calls for the estimation of the proportion of
households which contain at least one member over 65 years of age. The city has 621
households according to the most recent city directory. A simple random sample of n = 60
households was selected from the directory. At the completion of the field work, out of the 60
households sampled, 11 contained at least one member over 65 years of age. Estimate the true
population proportion, P, and find the sampling error and the coefficient of variation of the
proportion.
15.
In Exercise 14, how large a sample should be taken in order to estimate P with an error of
estimation of 0.08? Assume the true proportion P is approximately 0.2.
Answer: n = 84
16.
An investigator is interested in estimating the total number of “count trees” (trees larger than a
specified size) on a plantation of N = 1500 acres. This information is used to determine the
total volume of lumber for trees on the plantation. A simple random sample of n = 100 oneacre plots was selected, and each plot was examined for the number of count trees. If the
sample average for the n = 100 one-acre plots was
with a sample variance of s2 = 136,
estimate the total number of count trees on the plantation and find the sampling error and the
coefficient of variation of the estimated total.
Answers: Total = 37,800; se(total) = 1689.97; cv(total) = 4.47%
17.
Using the results of the survey conducted in Exercise 16, determine the sample size required to
estimate T, the total number of trees on the plantation, with an error of estimation E = 1500.
Answer: n = 388.
18.
You want to design a household survey to estimate average annual income per household. The
number of households is 2,000,000. On the basis of the data from a previous census, the
population variance of annual income per household is estimated to be 1,000,000 (that is, S =
1000).
66
a. What sample size is necessary to estimate the average annual income with a 95 percent
confidence that the result is accurate to plus or minus $100?
Answer: n = 385.
b. What size sample is necessary to estimate average annual income within plus or minus $50,
also at the 95 percent confidence level?
Answer: n = 1,537.
19.
Refer to the universe of eight farms listed below with known value of land and buildings as
follows:
Farm 1 - $2026
Farm 2 - $6854
Farm 3 - $1532
Farm 4 - $2180
Farm 5 - $5408
Farm 6 - $9284
Farm 7 - $1438
Farm 8 - $8836
a.
List the 28 simple random samples of two farms each, compute the mean for each sample and
verify that the average mean of all 28 means is $4,694.75.
b.
Compute the standard deviation of the 28 means and check that the standard deviation is
$2,037 (or 2,036.776).
67
Chapter 6.
PRACTICAL CONSIDERATIONS
IN SELECTING A SAMPLE
_____________________________________________________________________________________________
6.1 SAMPLING FRAME
In order to select a sample, it is necessary to have a sampling frame; that is, a list of all elements (or
the equivalent, such as a list of blocks, housing units, etc.) so that the probability of selection of each
element can be known in advance. The frame need not be literally a list. In sampling from cards,
questionnaires, etc., the documents themselves can be considered as the frame. But it is necessary to
know that the file is complete. For example, in sampling from a file of records, one should make
sure that no records are out of the file--in use or waiting to be refiled--since such records would not
have any chance of selection. Again, in using a population register maintained by local authorities,
one should make certain the list is current. For example, the list might not contain all families with
married couples. Since new families and those that move around are likely to differ in their
characteristics from older and more settled families, a biased sample would result.
In using local registers or lists, it may be useful to conduct an actual check of the completeness, on a
more or less informal basis. This can be done by going out to the area to be sampled, selecting a few
families (or farms or business firms) scattered around the area, and checking to see if they are on the
list. If possible, it is better to select families of the type likely to be missing from the list, since this
would provide a better test. A rough idea of the adequacy of the list can be obtained in this manner.
6.2 PROBABILITY OF SELECTION OF UNITS
Special difficulties arise when some units have more than one chance of selection--for example,
when sampling from a file in which some individuals are included more than once; when selecting a
sample of families from a sample of individual persons; etc. To illustrate, one might select a sample
of school children and use it to select families. It is clear that if one draws a sample of families by
first selecting a sample of persons and including the families to which these persons belong, the
families will have unequal probabilities of selection, since the larger the family the greater the
chance of selection. Similarly, selecting a sample of a business firm's customers by using a record
file containing a separate sheet (or card) for each purchase will give customers making more than
one purchase a greater chance of selection.
To avoid the biases which result from giving some of the units a greater chance of selection than
others, it is desirable to restrict the sampling procedure so that each unit has only one chance of
selection. For example, when selecting a sample of families, we could make a rule to include the
family only if the head of the family is the person selected. Since each family has only one head,
each family would have the same chance of selection.
The specified person on whom the selection of the family depends need not be the head; he/she could
68
just as well be the oldest person, the youngest child, etc. The only requirement is that each family
have one and only one such member. Similarly, in sampling customers, we could restrict the sample
by using only the cards with the earliest date for each customer, etc.
While the technique described in the preceding paragraph is generally recommended, whether the
sample is drawn from a file, a set of questionnaires, or is selected in the field, there are other
techniques that might be used. They will provide unbiased estimates of the universe, although they
do not strictly satisfy the conditions of simple random sampling. Some of these techniques are:
(1)
After selecting the initial sample by including all families for which one (or more) person has
been selected, we group the sample by size of family. It is clear that families with 2 members
have twice the chance of selection as those with 1; families with 3 members have three times
the chance of selection; etc. Therefore, instead of interviewing all families in the sample, we
interview only ½ of the two-member families; 1/3 of the three-member families; etc.
(2)
Proceed as above, but interview all families instead of ½, 1/3 etc. However, in tabulating the
results, tabulate each size class separately, and multiply the results of the two-person families
by ½, the three-person families by 1/3, etc., before adding the results together.
6.3 FRAMES INCLUDING OUT-OF-SCOPE UNITS
Sometimes the only available frame is a list which includes some units which are outside the scope
of the universe defined for the survey. For example, suppose a special analysis is desired of the
census characteristics of males. The only source for sampling is a card file containing cards for all
persons both male and female, and it is not feasible to remove all the cards for females. The file can
still be used as a frame even though cards for both males and females will be designated by the
random selection process. The proper procedure in such a case is to take only the cards for the
males selected, and disregard those for the females.
Do not substitute. A procedure that is sometimes erroneously used (and may cause serious bias) is to
substitute the next "male" card in the file for each "female" card drawn in the sample. There are two
things wrong with this method:
(1)
It results in a higher sampling rate than that specified. Also, the sampling rate actually
obtained cannot be calculated unless the total number of males is known. This makes it
impossible to use the reciprocal of the sampling rate, N/n, as a multiplier to produce estimates
of totals from the sample.
(2)
A more serious objection to this substitution lies in the biases it may introduce in the selection
process. Suppose we have a list of all housing units and we wish to select a sample of
occupied dwellings only. If we use a procedure that substitutes the next occupied unit for each
vacant housing unit that falls into the sample, occupied units that are neighbors of vacant ones
will have two chances of selection--the chance that their own listing entry is selected and the
chance that the listing of the neighboring vacant dwelling is selected. If vacant units are more
likely to be found in poor and undesirable neighborhoods, this would mean that occupied
housing units in such areas would be over-represented in the sample.
69
6.4 SYSTEMATIC SAMPLING
The work necessary to draw a simple random sample can be quite burdensome when the number of
units to be selected is large. For example, to get a 5 percent sample of 20,000 elements, it would be
necessary to select 1,000 random numbers from a table of random numbers and then to select the
designated units from the population. In practice, most statisticians prefer a different method. A
sample of this size is usually drawn by taking a random number between 1 and 20, then taking every
20th element thereafter. Thus, if the random number is 3, the elements taken will be 3, 23, 43, 63,
and so on up to 19,983. The reciprocal of the sampling rate (20 in this case) is called the sampling
interval. The method of estimating the mean, total, or a proportion is the same as for simple random
sampling.
This type of sampling is called systematic sampling. It is not the same as simple random sampling,
but it is an acceptable sampling method because the chance of selecting any one element is known
and we can calculate the sampling errors.
If the elements in the population are arranged in a nearly random order (that is, with very
little correlation between successive elements), the results of systematic sampling will be in
close agreement with those of simple random sampling. Experience shows that, generally, the
two methods will give results of roughly the same accuracy. The systematic sample will often
have a somewhat smaller sampling error, since it will make certain the sample will be spread
throughout the population. We may make use of the formulas for simple random sampling to
evaluate the reliability of estimates from a systematic sample; the result will usually somewhat
overstate the standard error for systematic sampling. In other words, we will underestimate, slightly,
the reliability of the estimates. There are ways of calculating the standard errors of systematic
samples more precisely; however, they are not covered in these chapters.
6.4.1 General Procedure for Selecting a Sample
The systematic sample selection procedure consists of the following steps:
1. Assign serial numbers from 1 through N to the population units.
2. Calculate SI = N/n, the sampling interval
-
for exactness, carry as many decimals as possible
you may round if you are doing this without a calculator, but you would be sacrificing
exactness for convenience
3. Select a random number (RN) from a table of random numbers between 0 and the SI. This
is called a random start (RS)
-
in the permitted range, exclude zero, but include the sampling interval
use as many digits as SI has, including decimals
if you are searching through a RN table, pretend the decimal point is not there
if you are using a calculator which only provides random numbers between zero and
70
one, multiply this random number by the value of SI in order to get a random number
between zero and SI. Remember to keep the decimals, do not round yet.
4. Begin the series of cumulated numbers with RS. Add SI to this first number to determine
the second. Then, add SI to the second number to get the third, and so on.
-
Do not round decimals during the addition process
5. Stop cumulating when the last cumulated number exceeds N (discard this last number)
-
this should occur when you have cumulated n numbers
if you rounded SI before adding, you may not have exactly n
6. Now go back and round all the cumulated numbers up to the next integer
7. On the list of population units, circle the serial numbers that correspond to these integers
-
These are the selected units.
Example 6.1
Suppose that a village contains 285 housing units (HUs) and we wish to select a systematic sample
of 12 HUs for a survey. Assume the list is randomly ordered.
We want to determine the HUs that will be in the sample.
1. SI = N/n = 285/12 = 23.75
2. RN between 0001 and 2375 is 1979
3. RS = 19.79
4. Series of cumulated numbers:
Sample Unit
1
2
3
4
5
6
7
8
9
Selection Number
of Selected Unit
Actual Unit Selected
(Serial Number after rounding up)
19.79
19.79+23.75=43.54
43.54+23.75=67.29
67.29+23.75=91.04
114.79
138.54
162.29
186.04
209.79
71
20
44
68
92
115
139
163
187
210
10
11
12
13
233.54
257.29
281.04
304.79
234
258
282
(Discard)
Remarks: Let's see what might have happened if we had not carried the decimals.
1. SI = N/n = 285/12 = 23.75 rounded up to 24.
2. Suppose RN between 01 and 24 is actually 24.
3. RS = 24
4. Results:
(1)
(2)
(3)
.
.
(11)
(12)
24
48
72
264
288 (discard).
We exhausted the population before reaching our 12 units. This would not have happened if we had
kept the decimals (had not rounded up at the beginning), even if our RN was equal to the SI.
6.4.1.2
Useful Variation for Use with Computer Software Packages
We accomplish the same results by truncating instead of rounding up. Refer to Section 4.1 above.
-
In step 3 of Section 4.1, while choosing RN, include zero but exclude SI;
-
Add 1 to RN to define RS;
Then, in step 6, truncate (that is, retain only the integer portion of the number), instead of
rounding up.
This alternative is convenient when using computer software packages because their rounding
functions usually round up to the closest number instead of up systematically. So, it is better to use
the integer functions which truncate systematically.
Let's look at an example in order to clarify the concepts. Refer to the previous example.
1. SI = N/n = 285/12 = 23.75
2. RN between 0000 and 2374 is 1979.
72
3. RS = 19.79 + 1 = 20.79
4. Series of cumulated numbers:
Sample Unit
1
2
3
4
5
6
7
8
9
10
11
12
13
RN + k (SI)
Actual Unit Selected
(Serial Number after
truncating)
20
44
68
92
115
139
163
187
210
234
258
282
(Discard)
20.79
20.79+23.75=44.54
44.54+23.75=68.29
68.29+23.75=92.04
115.79
139.54
163.29
187.04
210.79
234.54
258.29
282.04
305.79
6.4.2 Caution in the use of systematic sampling
There is one situation in which systematic sampling will give very poor reliability. That is the case
in which the arrangement of the elements in the population follow a very regular (periodic) pattern
and the sampling interval of the systematic sample falls into that pattern. For example, suppose all
families in a certain population consisted of exactly four persons--the head, his wife, and two
children. The population has been listed in the order just given and we wish to draw a 25 percent
systematic sample from this list to obtain some special information. Since the sampling procedure is
to take every fourth person starting at random, four possible samples could be obtained:
(1) Random start is 1--the sample will consist entirely of heads of families.
(2) Random start is 2--the sample will consist entirely of wives of heads.
(3) Random start is 3 or 4--the sample will consist entirely of children.
In a case such as this, results from sample to sample would have nearly the maximum possible
variation, and it would be likely that estimates based on any one of the samples would be quite far
from the true values for the population. However, even in this extreme case, the estimates would be
unbiased; that is, the averages of the estimates for all possible samples would be the population
averages.
Although the example given above is not likely to occur in practice, approximations to this situation
sometimes arise. If there is suspicion of any regularity in the sequence of listing, which could
conform to the sampling interval, systematic sampling should be avoided or modified. For example,
the list could be randomized before systematic selection is used.
73
6.4.3 Modified systematic sampling
One variant of systematic sampling that could be used when there is some systematic ordering in the
population is to use a different random number within each sampling interval. To illustrate, let us
use the previous example of 25-percent sample when family members are listed in order--head, wife,
child. With a systematic sample, once a random number is selected, this sets the pattern for the
entire sample. As explained above, if the random number is 1, the sample will be the 1st, 5th, 9th, 13th
person, etc. (all heads of families); if the random number is 2, the sample will include the 2nd, 6th,
10th, 14th person, etc. (all wives of heads). To avoid this difficulty, we can select a different random
number within each group of 4 persons, so as to avoid a constant interval between our sample cases.
The selection scheme is indicated below:
Random
number (1
to 4)
Group of
four
persons
Person
selected
3
1st
3rd
1
2nd
5th
2
3rd
10th
1
4th
13th
4
5th
20th
etc.
That is, in the first group, one child is selected because the random number is 3. In the second
group, the husband is selected because the random number is 1 and the husband is the first person in
the group, but the fifth person in the list. In the third group, the second person is selected (the wife),
who is the 10th person in the list, and so forth.
The system requires more work than ordinary systematic sampling, but it avoids the possibility of the
patterns indicated above. We do not mean to imply that such patterns as described above usually
exist and that systematic sampling should be avoided. In most cases, systematic sampling produces
very satisfactory results.
6.4.4 Serial number as a sampling source
Frequently, in sampling office files, the records have a serial number. We may take advantage of this
fact to draw the sample; for example, by designating all records whose serial numbers end in 5, 7, or
some other number chosen from a table of random numbers. However, before deciding on this
system, one should make sure that the last digit of the serial number is actually random, and does not
represent a nonrandom arrangement of some kind; if it does, we might obtain only one particular
type of unit in the sample by repeatedly selecting the same last digit. If such a serial number is not
present, frequently one can be assigned at random with little cost, and used for sampling.
74
6.5 GUIDELINES ON WHEN TO USE DIFFERENT SAMPLING
SCHEMES
6.5.1 When to Use Simple Random Sampling (SRS)
Some situations which suggest the use of SRS are:
1. There are no major cost differences associated with including various classes of sampling
units in the sample.
2. The population is relatively homogeneous with respect to the major characteristics being
estimated.
3. There is no auxiliary information available for the population units.
4. There are no cost savings in surveying units which are close together or other natural
clusters of the population.
5. A sampling frame which lists each population element is available.
6. There is no need to make separate estimates for subdivisions of the population.
It should be noted that none of these reasons on its own is enough to justify the use of SRS.
6.5.2 When to Use Systematic Sampling
There are several reasons for using systematic sampling, but in practice, the main reason usually is:
- to select a SRS quickly (from a randomly ordered list)
This type of systematic sampling is suggested for SRS when:
1. The frame is a record system requiring a manual selection of sample units (e.g., a physical
list, card files, etc.)
2. Sampling units are arranged in random order.
3. Time and resources for selecting the sample are limited.
4. No periodicity is suspected in the data.
Systematic sampling can also be used to provide implicit stratification during sample selection if
sampling units are arranged in a particular order. This type of sampling, however, would not be
SRS.
75
6.5.3 When to Use Stratification
Some situations which suggest the use of stratified sampling are:
1. Natural or predefined strata of the population exist: e.g., geographic divisions such as
states, provinces; ecological zones that have great socioeconomic impact on the population,
etc..
2. There exist subpopulations of interest for which separate estimates of a given precision are
required.
3. For administrative convenience, such as regional offices of national statistical offices.
Strata could be created so that each regional office can handle the sampling and the
interviewing in their respective areas.
4. Stratification can provide a reduction in cost.
5. Stratification can provide a reduction in variance. This would occur if
a. The variables of interest are correlated with the variable of stratification.
b. The potential strata are internally homogeneous with respect to the variables of interest.
6. Auxiliary information upon which to base the stratification is available for all population
units.
7. Different sampling strategies are required in different parts of the population.
6.5.4 When to Use Single-Stage Cluster Sampling
Some situations which suggest the use of cluster sampling are:
1. Natural or predefined clusters of the population exist: e.g., Metropolitan Statistical Areas
(MSAs), Enumeration Districts (EDs), Enumeration Areas (EAs), etc.
2. Confining sampling operations to units that are nearby produces large cost and time
savings.
3. No frame is available which lists all population elements but one could be constructed for a
limited number of clusters to list all elements in the cluster.
4. Elements within clusters are heterogeneous with respect to variables of interest.
5. Cluster means of the variables of interest are similar among themselves.
6. Cost savings justify the relative loss in precision.
76
7. Nonsampling errors can be controlled more effectively (e.g. listing operation can be done
more accurately for a cluster than for the whole population, yielding better coverage).
It is generally recommended that clusters be selected either with probability proportional to size or
with equal probabilities after stratification by size. In addition, it is recommended that larger clusters
be placed in certainty strata so they may all be included in the sample. This is done in order to
control the variance of estimates.
6.5.5 When to Use Multi-Stage Sampling
The situations which suggest the use of a multistage design are the same as for single stage cluster
sampling except that multistage sampling is preferred over single stage sampling when:
1. It is operationally impractical to survey all elements in a cluster, or
2. Only a limited number of sample elements can be handled, and concentrating them in a few
clusters would result in estimates of poor precision. In such a case, it would be more
efficient to spread the sample over more clusters and only subsample each cluster.
Remarks:
The above guidelines for using different sampling schemes are not meant to be rigid or
exhaustive. In practice, there might be:
-
multiple survey objectives that conflict with one another, or
-
survey objectives which conflict with survey resources.
Hence, it is usually necessary to compromise in selecting a design or often to combine designs.
6.6 CONTROLS
After a sample is selected, it is necessary to check the number of cases actually obtained against the
number expected (as calculated by applying the sampling rate to the number of cases in the
universe). Discrepancies may indicate that the sampling procedure was not properly carried out. For
example, forgetting to sample from file drawers in use at the time of sampling, and thus omitting part
of the population, would result in fewer cases than expected. Further checks on whether the sample
shows any unusual features may also help us know whether the sampling was actually performed as
planned.
6.7 USE OF CHECK DATA IN SAMPLING
Very frequently, when a sample has been selected for a study, sample data will be collected and
tabulated for a set of basic items for which there are already available known population totals in
addition to the items of special interest in the survey. Such known population totals are called
"check data" or "independent information." If the sample results for the known items agree closely
77
with the known population totals, it is sometimes claimed that this coincidence "validates" the
sample and proves it will provide good results for other items.
Actually, this so-called "validation" does not demonstrate that we have a "good" sampling procedure,
or that the sample will yield "good" estimates for the other items in the survey. It is only on the basis
of a random method of selecting the sample that we are able to attach a sampling error to our
statistics, and to evaluate the probability that the estimates will be within specified limits of the true
value: therefore, it is obvious that we cannot rely exclusively on such "validation."
Nevertheless, there are three acceptable uses of check data:
(1)
Available check data may be used in improving the method of sampling; for example,
in providing a basis for stratification. (This is the subject of the next two chapters.)
(2)
It is possible to calculate the standard errors of the estimates made from the sample
data. If the check data and sample estimates of the same items differ more than might
reasonably be expected from the size of the calculated standard errors, this may
indicate that the sampling procedures may not have been carried out properly, the
sampling frame has coverage errors, or something else may have gone wrong in the
implementation of the survey. Further investigation is needed.
(3)
Check data may be used in improving the method of making estimates from the
sample; for example, by adjusting the sample estimate by the ratio of the true value of
the check item to the sample estimate of this check item (using a ratio estimate). We
will discuss this more fully in later chapters.
The above three applications of the use of check data (or independent information) are acceptable,
since we can make statistical inferences when using them.
6.8
SAMPLING WEIGHTS IN SRS
Recall that the sampling weight of a sample unit is equal to the reciprocal of the probability of
selection. In SRS, the probability of selection is (n / N). Therefore, the sampling weight is equal to:
Sampling Weight =
6.9
SELF-WEIGHTING AND NON-SELF-WEIGHTING SAMPLES
A sample is self-weighting if every unit in sample has the same probability of selection. By their
very nature, SRS samples are self-weighting. However, in practice, most complex designs produce
non-self-weighting samples. For instance, a higher sampling fraction is used for a the stratum that
contains large businesses (sometimes all of them are chosen); in demographic surveys (say health)
we may oversample special minority groups in order to obtain better estimates with smaller
variances. In addition, almost any self-weighting design becomes non-self-weighting due to
78
adjustments to the basic weights.
Exercises
6.1
You have a population of 185 persons. Select a systematic sample of 20 persons. List the
numbers assigned to them and describe the procedure you used in the selection.
6.2
Suppose that a city block contains 125 housing units. We wish to select a systematic sample of
10 housing units. Follow the steps we discussed in this chapter to accomplish this.
79
Chapter 7
STRATIFIED SAMPLING-BASIC THEORY
__________________________________________________________________________
7.1
DESCRIPTION OF THE STRATIFICATION PROCEDURE
In simple random sampling, we do not try to force the sample to be representative of different groups
in the population. The tendency to be representative is inherent in the procedure itself and the
sampling error can be reduced only by increasing the size of sample. However, if something is
known in advance about a population, it may be possible to use this information in stratification and
thus reduce the sampling error. The judgment of experts may be useful here.
Stratified random sampling is a method in which the elements of the population are divided into
groups (strata), and a simple random sample is selected for each group, taking at least one element
from each group (stratum). One element from each group is sufficient to estimate the mean, but two
are needed to estimate its reliability; generally many more than two are needed to make the estimates
sufficiently precise. The process of establishing these groups is called stratification and the groups
are called strata. The strata may reflect regions of a country, densely populated or sparsely populated
areas, various ethnic or other groups.
In stratification we group together elements which are similar, so that the population variance
within stratum h is small; at the same time, it is desirable that the means of the several strata
be
as different as possible. The letter h will be used to identify the strata so that if L strata are created, h
will go from 1 to L.
In stratified sampling, the probabilities of selection may be the same from group to group, or they
may be different. It is not necessary that all elements have the same chance of selection, but the
chance of each must be known. Under stratified random sampling all the elements in a particular
stratum have equal chances of being selected. While not every combination of elements is possible,
all of the possible samples (that is, combinations of elements) that might be drawn have the same
chance of occurring.
In stratified sampling, the selection of sampling units, the location and enumeration of the selected
units, distribution and supervision of fieldwork and, in general, the whole administration of the
survey is greatly simplified. The procedure, however, presupposes the knowledge of the strata sizes,
that is, the total number of sampling units in each stratum as well as the availability of a frame for
selecting a sample from each stratum.
The most important aspect of a good stratification is that it lowers significantly the sampling error of
the estimates if the stratification variable is highly correlated to the variables of interest.
80
7.2
NOTATION
We use the same notation as for simple random sampling, except that there will be a subscript to
indicate a particular stratum when we refer to information regarding this stratum. Thus, N will
represent the total number of elements in the population, as before; but N1 will be the number in the
first stratum, N2 will be the number in the second stratum, etc. Similarly, n will be the total sample
size; n1 will be the size of the sample in the first stratum, n2 will be the size of the sample in the
second stratum, etc. The subscript h denotes the stratum and I the unit within the stratum. As in the
case of simple random sampling, capital letters refer to population values and lower case letters
denote corresponding sample values. The following notation given in the table will be used.
Measurement
For Population
For Sample
Sample Estimate
Total number of elements
N
n
--
Number of strata
L
L
--
Number of elements in the h th stratum
Nh
nh
--
Total for a certain variable (characteristic)
Y
y
Total of the variable in the h th stratum
Yh
yh
Average over all strata (population mean)
Average for h th stratum (stratum mean)
--
Proportion having attribute
P
p
p st
Proportion in the h th stratum
Ph
ph
--
Population Variance
S²
--
--
Population variance for the h th stratum
--
Variance of an estimated total
Variance of an estimated mean
Value of a specific unit
--
81
7.2.1
Illustration for a Whole Population
Suppose we have a universe of eight farms with known value of land and buildings as follows:
Farm
Value of land and
buildings
1
$2026
2
6854
3
1532
4
2180
5
5408
6
9284
7
1438
8
8836
Let us compute the average (mean) and the standard deviation of these values. In terms of the
notation above, we would have
N=8
= $4,694.75
S = $3,326.04
Now let us arrange the farms into two strata, so that the groupings of values are as follows:
Stratum 1
Stratum 2
$1,438
1,532
2,026
2,180
$5,408
6,854
8,836
9,284
If we compute the average and standard deviation of each group of four farms separately, we would
have
82
7.3
Stratum 1
Stratum 2
N1 = 4
N2 = 4
= $1,794
= $7,595.50
S1 = $364.33
S2 = $1,800.45
ESTIMATES FROM A STRATIFIED SAMPLE
The population mean can be expressed in terms of the stratum totals, as follows:
(7.1)
where the population total
Since each
can be expressed as
we may write
(7.2)
Within each stratum, simple random sampling is used. We saw previously that for simple random
sampling, is an unbiased estimate of
This suggests that for stratified sampling an estimate of
the population mean can be obtained by substituting, for each stratum mean, the corresponding
estimate from the sample. That is, the mean of the sample elements from the first stratum gives an
estimate of the true mean of the first stratum; the mean of the sample elements in the second stratum
gives us an estimate of the true mean for the second stratum, etc. In symbols, therefore, the estimate
of the population mean from a stratified sample is denoted by
(st for stratified) and is given by :
(7.3)
Another way of expressing the same formula is
(7.4)
where yh is the sample total for the hth stratum.
83
7.3.1
Illustration of estimate of mean
A stratified sample is drawn from a population of 1,000 farms to estimate average expenditure by
farm operators for hired labor. There are three strata--the total number of farms in the first is 300; in
the second, also 300; and in the third, 400. The selected samples have 30, 30, and 40 farms in the
three strata respectively. The average expenditure for the 30 farms in the first stratum is $12.20; for
the 30 farms in the second stratum, $25.60; and for the 40 farms in the third stratum, $48.70. For the
sample estimate of the average expenditure for all farms in the population we would have
7.3.2
Estimate of total
As with simple random sampling, we make an estimate of the population total by multiplying the
estimate of the mean by the total number of elements in the population:
(7.5)
7.3.3
Estimate of proportion
To estimate a proportion for the population, the procedure is similar to that for the mean because a
proportion, Pst is simply a special case of the mean when the only possible values of are 0 and
1. In this case,
for stratified random sampling. The true population proportion Pst is given by
and it is estimated by
(7.6)
84
7.4
SAMPLING ERROR OF A STRATIFIED SAMPLE
The sampling errors of the three types of estimates referred to above are computed by using equation
(7.7) for the mean, equation (7.8) for the total, and equation (7.9) for a proportion:
(7.7)
where
(7.8)
(7.9)
The corresponding formulas for the estimated sampling error for each type of estimate are:
(7.10)
where the standard error of the sample is,
(7.11)
85
(7.12)
Similar formulas can be derived for the coefficient of variation by dividing the above expressions by
the value of the item being estimated. Thus, for example:
(7.13)
The formulas for confidence intervals of the population mean and the population total are:
(7.14)
(7.15)
These formulas assume that
is normally distributed and that
is well determined, so that the
multiplier t can be read from tables of the normal distribution (see Appendix I). If only a few
degrees of freedom ( less than 30) are provided by each stratum the t-value should be taken from the
tables of student's t (see an Appendix II) instead of the normal table.
7.4.1
Illustration
Let us apply equation (7.7) to the case of the eight farms in the illustration in section 2. Suppose we
took a sample of four farms out of the eight--two from each stratum--and we have computed
by
equation (7.3). What is the sampling error of
In the two strata, the values would be:
Stratum 1
Stratum 2
N1 = 4
N2 = 4
n1 = 2
n2 = 2
364.33
= 1,800.45
86
= 132,736.35
= 3,241,620.2
Applying equation (7.7)
It is interesting to compare this sampling error with the corresponding sampling error of the mean for
a simple random sample of four farms. For a simple random sample of four farms, we would have
In this example, the sampling error of the stratified sample is much smaller than that of the simple
random sample, less than half. In fact, it would require a sample of six farms, using simple random
sampling, to achieve the same reliability (that is, as small a sampling error) as we obtained with a
stratified sample of the four farms.
7.4.2
Remarks
In actual practice, we usually do not know the true values
and
Instead, we substitute
sample estimates of these values into equations (7.7), (7.8), and (7.9) to obtain equations (7.10),
(7.11) and (7.12), respectively. To make such estimates from a single sample, we would need at least
two elements from each stratum. (In the examples described above, we were able to compute the
standard error for samples having only one element per stratum because we had information on all
elements in the universe.)
To derive equation (7.7), we do the following:
(7.16)
Apply the variance operator
to each side of equation (7.16).
87
(7.17)
(7.18)
(7.19)
(7.20)
But
for h = 1, 2, 3, . . . , L. Therefore, we can write (7.20) as:
(7.21)
Equation (7.21) is equivalent to equation (7.7).
We will now rewrite equation (7.21) in a different way to make some observations.
(7.22)
which can also be written the following way:
(7.23)
From equation (7.22) we can see that if the fpc = 1, i.e., if
(7.22) becomes:
88
then equation
(7.24)
Equation (7.23) has two components. The first component is shown in equation (7.24) and it
represents the variance of the mean when sampling is done with replacement, that is, when the fpc =
1.
The second term in equation (7.23) represents the adjustment that one needs to make when sampling
is done without replacement.
We can also see from equation (7.24) that the variance of the mean is directly proportional to the
strata population variance. That is, the smaller the population variance in the strata, the smaller the
variance of the mean. In other words, the more homogeneous the strata, the smaller the overall
variance of the mean with stratified sampling.
89
Exercises
7.1
Suppose you have a population of 12 persons whose hourly earnings are as follows:
Person
Hourly Wage
1
2
3
4
5
6
7
8
9
10
11
12
0.85
1.35
0.60
2.20
1.80
3.10
0.90
1.50
1.75
0.75
2.40
2.10
a.
What is the average (mean) hourly earnings for this group?
b.
What is the sampling error of the mean for a sample of six persons selected as a simple
random sample?
c.
Stratify this population into three strata of equal size in the best way to estimate average
earnings. List the persons in each stratum by their hourly earnings.
d.
Select a sample of six persons--two from each stratum. Suppose from stratum I we obtain in
sample the values (0.60) and (0.90); from stratum II we obtain the sample values (1.35) and
(1.50); and from stratum III we get (3.10) and (2.10).
(a) Estimate the average (mean) hourly earnings for this sample.
(b) Obtain the sampling error of the estimated mean.
90
Chapter 8
STRATIFIED SAMPLING-ALLOCATION TO STRATA
______________________________________________________________________________
8.1
THE PROBLEM OF ALLOCATION
The definition of stratified sampling does not specify a particular size of sample in a stratum. The
sample can be selected so as to have the same size in each stratum, or it can be distributed in some
other way. As long as we select at least one element per stratum, the specification for a stratified
sample is satisfied; and with two elements per stratum we can estimate both the mean and its error.
Usually the total sample size is much larger than two elements per stratum. Hence, the question
arises as to what criterion should be used in allocating the total sample among the strata.
Let us return to the earlier example of a population of eight farms in two strata. If we wish to select
a sample of two farms to estimate the mean, we have no choice but to take one farm from each
stratum. Suppose, however, that we wish to select four farms. Then we have a choice in the
allocation of the sample. Would it be better to select two farms from each stratum or take one farm
from one stratum and three farms from the other?
There are two important criteria for determining how the sample should be distributed among the
various strata. The first criterion is convenience; that is, choose a method which is easy to apply and
simple to tabulate. This usually leads to the use of proportionate or proportional (allocation)
stratified sampling. The second criterion is precision: choose a method which will provide the
smallest sampling variance (or sampling error). This leads to the use of optimum allocation.
8.2
PROPORTIONATE STRATIFIED SAMPLING
It is very common in stratified sampling to select the same proportion of units in each stratum. With
this method, to take a 10-percent sample of a given population, we would take a 10-percent sample
from each stratum.
Since the sampling rates in all strata are the same, the number of elements taken for the sample will
vary from stratum to stratum, depending on the size of the stratum. Within each stratum, the sample
size will be proportionate to the total population of the stratum. We can express this mathematically
as follows:
or alternatively
For the population characteristics that we are usually interested in (namely, Y and
we can
prepare estimates from a proportionate stratified sample as easily as from a simple random sample-in fact, by using the same formula
8.1)
In this formula, the sum is for all sample elements without regard to strata; since (Nh/nh) is a
constant, and equal to (N/n), equation (7.3) of chapter 7 reduces to this form. We also have
(8.2)
where I in equations (8.1) and (8.2) refers to individual observations.
The simple weighting procedure makes proportionate sampling attractive since results are easy to
tabulate. Different strata do not have to be tabulated separately. All of the sample data can be added
together before application of any factors such as (1/n) or (N/n). A sample which has this feature is
self-weighting. That is, in a self-weighting sample, every individual observation has the same
probability of selection and, consequently, the same weight. The true standard error of the mean
estimated from a proportionate stratified sample is
(8.3)
When we substitute
in equation (8.3), this becomes,
(8.4)
(8.5)
Proportional allocation has many advantages:
1.
In order to use this allocation procedure we don't need to know the stratum variances (as the
methods we'll discuss later do).
2.
Other methods require us to know the costs of sampling units in the different strata, but not
this method.
3.
4.
The increase in precision from other more elaborate methods is not very large.
Efficient for national-level estimates.
92
However, we will see later on that when there is a very large variation in the stratum variances, the
gain in precision obtained by other methods may outweigh the simplicity of proportional allocation.
However, as shown later, this method is widely used in applied sample design.
8.3
OPTIMUM ALLOCATION
Sometimes we have to conduct a survey with a fixed amount of money and we may be faced with the
fact that the cost of sampling units in different strata differs widely. For instance, it is a well-known
fact that sampling units in rural areas is generally more expensive than urban areas, because the
distances are longer and sometimes sampling units are more difficult to find. The term optimum
allocation refers to the optimum (the most efficient) way of allocating the total sample (n) to the
different strata. The formula is given by:
where ch is the cost of sampling one unit in stratum h. The above formula is obtained by finding the
values of nh that will minimize
subject to the linear constraint
When the costs of sampling in the different strata are the same, the optimum allocation formula is
called Neyman allocation, after Jerzy Neyman (1934), who investigated mathematically the question
of what distribution of the sample among strata would give the smallest possible sampling error. He
found that the answer was to let the sampling rate in each stratum vary according to the amount of
variability in the stratum--in other words, to make the sampling rate in a given stratum proportional
to the standard deviation in that stratum. The number of elements to be sampled from any stratum,
then, would depend not only on the total number of elements in that stratum, but also on the standard
deviation of the characteristic to be measured. For Neyman allocation, the number to be selected
within a stratum is given by the following formula:
(8.6)
With Neyman allocation, the formula for the variance of the mean (after using (8.6) in formula (8.3))
reduces to
(8.7)
The second term on the right represents the use of the fpc.
93
As before, the standard error of the total is given by the following formula:
(8.8)
For this type of allocation, it is necessary to know the values of Sh in the universe. If these are not
known in advance, then X may be estimated within each stratum, by using the methods described in
Chapter 5.4, (p. 53).
Note that in formula (8.6), when the Sh are all equal, Neyman allocation becomes proportionate
allocation.
8.3.1
Illustration
Let us compare the standard errors arising from proportionate and optimum allocation in the same
survey. In 1942, a census of lumber production was taken in the United States. In 1943, the survey
was to be repeated, but on a sample basis. Before selecting the sample, mills were grouped into
strata, on the basis of their 1942 production; an analysis of the data produced the information
presented in Table 8.1.
Table 8.1
BASIC DATA FOR DETERMINING OPTIMUM ALLOCATION
(Production figures and standard deviations given in thousands of board feet)
1942
Stratum
Annual
Production
1
5,000 and over
538
11,029.7
9,000
2
1,000 to 4,999
4,756
1,779.6
1,200
3
Under 1,000
30,964
203.8
300
36,258
571.2
Total
Number
of
mills
(Nh)
Average
production in
stratum
Standard
deviation
for 1943
(Sh)*
**1,684
*Estimated from 1942 data. **For unstratified sampling.
Now let us select a sample of 1,000 mills. The first question to consider is how to determine the
sample size in each stratum, under either proportionate sampling or optimum allocation sampling.
The second question to consider is the resulting reliability of the two methods. Let us consider first
the matter of the sample size, then the matter of reliability.
94
8.3.2
Sample Size in Each Stratum
For proportionate allocation, since the sampling rate is 1,000 out of 36,258, this rate is used in each
stratum. The sample sizes, therefore, would be:
n1 =
x 538 = 15
n2 =
x 4,756 = 131
n3 =
x 30,964 = 854.
For optimum allocation, the sample size in each stratum would be determined by the following table.
Table 8.2
SAMPLE SIZE FOR OPTIMUM ALLOCATION
Stratum
Number
of
mills
(Nh)
Standard
Deviation
(Sh)
NhSh
1
538
9,000
4,842,000
2
4,756
1,200
3
30,964
300
Total
36,258
Number
in
sample
(nh)*
Sampling
rate
0.244
244
½
5,707,200
0.288
288
1/16
9,289,200
0.468
468
1/66
19,838,400
1.000
1,000
*nh
8.3.3
Standard Errors
What are the standard errors for these two sample designs? For proportionate allocation, the
standard error of the estimate of the mean is given by equation (8.4):
95
For the survey of lumber production,
and
thousand board feet.
For optimum allocation, the corresponding standard error is given by equation (8.7):
To complete the analysis, one may compare these results with those obtained if we had not stratified
the mills, but had taken a simple random sample of 1,000 mills from the universe. In this case, the
standard error is given by:
8.4
COMPARISON OF SAMPLING ERRORS WITH DIFFERENT
SAMPLING METHODS
Examining the results of the sample designs above, we see that optimum allocation gave us a standard
error of 16.1 thousand board feet, considerably smaller than that under proportionate sampling, which
was 37.8; we see also that the sampling error under proportionate sampling was smaller than that
under simple random sampling, which was 52.5. Putting the results another way, it would require a
proportionate sample more than 5 times as large as an optimum allocation sample to achieve the same
reliability. Simple random sampling would require a sample 10 times as large. The efficiency of
optimum allocation results from the fact that it provides for more intensive sampling in strata having
large standard deviations, which can be expected to contribute more heavily to the total sampling
error.
The example in section 8.3 above illustrates a general result which can be demonstrated
mathematically. The sampling errors of the three types of designs are approximately related in the
96
following way (if the sampling rates are small enough so that the finite correction factors can be
ignored):
(8.9)
(8.10)
where
is a weighted average of the values of Sh and
and
are,
respectively, the variances of the estimated means based on simple random sampling, optimum and
proportionate sampling.
An examination of this formula shows that sampling errors obtained with optimum allocation will be
at least as small, and usually smaller, than those obtained with proportionate stratified sampling.
Furthermore, the errors obtained with either of these methods will be at least as small, and generally
smaller, than those obtained with simple random sampling. (There are a few rare cases, which almost
never occur in practice, in which this is not true. When the sample is very small and the stratification
is completely ineffective, neither proportionate sampling nor optimum allocation may show a gain
over simple random sampling. For all practical purposes, this possibility can be ignored.)
Consider the conditions under which important differences result from the three methods. When we
compare proportionate stratified sampling with simple random sampling, it can be shown that the gain
in reliability depends on the amount by which the means of the strata vary; the greater the variation
between the means (in other words, the greater the differences among the strata), the more the
reduction in the standard error arising from the use of proportionate sampling. On the other hand, if
the variance between stratum means is fairly small compared to the total variance, not much will be
gained by stratification. As a result, stratification is usually less important in dealing with proportions
than with measured items (or with aggregates or quantities). For example, it would be of much
greater help in trying to estimate the average expenditure of farmers for hired labor than in estimating
the proportion of farmers who hire labor. Even for measured items the gains would be slight unless
the strata are established so that the differences between the means are sizable (as was the case in the
example of lumber mills). For example, in conducting a survey to measure personal income, it would
probably not pay to establish separate strata for different professional groups--for example, doctors,
lawyers, etc. It probably would be useful, however, to set up separate strata for broader groups-laborers, businessmen, professionals, etc. Since proportionate sampling is nearly always better than
simple random sampling, stratification is recommended whenever it can be accomplished with little
additional work.
Comparing optimum allocation with proportionate allocation, we see that if the standard deviations in
all strata are the same, the two methods are identical. The greater the differences between the
standard deviations in the strata, the greater the reduction in sampling error to be expected from
97
optimum allocation. Unless the range among the standard deviations is greater than 2 or 3 to 1, the
gains of optimum allocation are so small that they are probably not worth the extra complications in
tabulation. With larger variations in standard deviations, the gains are appreciable and optimum
allocation is advisable. In the example of lumber mills, the standard deviation for stratum 1 was 30
times as large as that for stratum 3.
We need to know the Sh for each stratum either (a) to apply optimum allocation or (b) to estimate the
errors of proportionate stratified samples. Of course, in practice, we never really know each Sh and
must estimate it. Two questions arise: (a) How is the accuracy of the sample affected by the errors
introduced by estimating Sh instead of knowing the true value? (b) What methods can be used to
estimate these quantities?
In answer to the first question, if our estimates of the standard deviation are fairly reasonable (for
example, accurate to within 30% or 40%) we will obtain almost all of the gains of optimum
allocation. The reason for this is that the sampling error does not increase very rapidly as the
allocation departs from the optimum within fairly broad limits. (It should be noted that poor guesses
of the values of Sh do not introduce any biases in the result; they only increase the sampling errors.)
However, if the estimates of Sh are very unreliable, the "optimum allocation" may have a larger
variance than proportionate allocation. In this case, it is safer to use proportionate allocation.
In regard to the second question, we can use the methods for estimating the standard deviations
described previously (Section 5.4 of Chapter 5). One additional method that is sometimes used is to
assume that the standard deviations for the strata are proportional to the average values within the
strata; that is, assume the same relative standard deviation in each stratum. (Note that for optimum
allocation, it is not necessary to know the absolute values of the standard deviations; it is only
necessary to know their values relative to each other.) This assumption will frequently give results
reasonably close to the optimum. In the case of the lumber mills discussed previously, this would
give us a sample with the following distribution by strata:
It can be seen that this allocation is much closer to optimum allocation than is proportionate
allocation. In fact, if the standard error of this allocation is computed, it turns out to be 17.3. This is
not quite as good as the 16.1 for optimum allocation, but it is far superior to the 37.8 obtained with
proportionate sampling.
98
8.5
OPTIMUM ALLOCATION WITH VARIABLE COSTS
The discussion of optimum allocation thus far has been in terms of getting the most reliable results for
a given total sample size. It frequently happens that the costs of obtaining information vary
substantially from stratum to stratum. To give an example, let us suppose that families have been
stratified by urban and rural residence; furthermore, suppose that the cost of conducting a rural
interview is five times as great as that of an urban interview. It would be wise to concentrate more of
the sample in the cheaper stratum. Another example would be a sample survey of business firms; we
may mail questionnaires to small companies and visit large ones personally, when there are large
differences in unit costs.
A more general approach than the one which is described in section 4 above is to consider the
optimum allocation for a fixed cost, rather than for a fixed sample size. In other words, we would
like to allocate the sample among strata in such a way as to achieve the lowest standard error with a
fixed budget.
For this we need a cost function, which is a mathematical formulation expressing the cost of taking
the survey in terms of the sample sizes, nh. Suppose the average cost for a single questionnaire in the
hth stratum is called Ch. Thus C1 is the cost per questionnaire in the first stratum, C2 is the cost in the
second stratum, etc. Ch represents the total cost of a questionnaire in the hth stratum, including the
cost of interviewing, coding, data entry, etc. (There may also be an overhead cost for the survey
which does not depend on the size of the sample, but it is not necessary to consider this in the cost
function.) The total cost of the survey which can be affected by the sample size is
For a fixed cost C, the optimum allocation of the sample turns out to be
(8.11)
Note: To use this formula, n must first be calculated. Note that nh is a function of the C h 's, Sh 's, and Nh 's. See Sample Survey
Methods and Theory, Volume I: Methods and Applications, by Hansen, M.H., Hurwitz, W.N., and Madow, W.G. New York,
Wiley and Sons, 1953, p. 221.
That is, nh is directly proportional to Nh and to Sh, and inversely proportional to
Formula (8.11)
leads to several rules. In a given stratum, we would take a larger sample under the following
conditions:
(1)
If the stratum is larger than the average stratum.
(2)
If the stratum is more variable internally than the average stratum.
99
(3)
If the cost of collection and processing is cheaper than in the average stratum.
In regard to the third point, the cost per stratum (Ch) enters into the formula in the form of a square
root. This tends to reduce the effect of the differences in unit cost. Unless the costs vary by a factor
of at least 2 to 1, using the formula above will give results not very much different from the simpler
Neyman allocation given in equation (8.6).
In equation (8.11), we do not yet know the value of n. If cost is fixed, substitute the value of nh from
(8.11) in
and solve for n. This gives
(8.12)
If, however, an estimate with a specified variance is required, n is given by
(8.13)
Note that in case Ch = c, that is, if the cost per unit is the same in all strata, then the cost becomes
and also equation (8.11) reduces to equation (8.6), which is the formula for
Neyman allocation. That is, optimum allocation for fixed cost reduces to optimum allocation for
fixed sample size.
8.5.1
Illustration
Suppose a sampler proposes to take a stratified random sample. He expects that his field costs will be
of the form
His advance estimates of relevant quantities for the two strata are as
follows:
Stratum 1
Stratum 2
N1 = 1,056
S1 = 10
C1 = $4
N2 = 1,584
S2 = 20
C2 = $9
(a) Find the sample size required under optimum allocation, to make
fpc.
Ignore the
(b) Determine the sample size for each stratum (i.e., the allocation of the total sample size n to
each of the two strata).
100
(c) How much will the total field cost be (excluding overhead costs)?
Solution of (a)
For optimum allocation, the formula for n is given by:
where
Recall that
to 2 (it should be 1.96 to be exact). Therefore,
In our example, E = 2 since k is taken to be equal
Solution of (b)
The sample size in each stratum is given by equation (8.11). For the sample size in the first stratum:
Similarly, n2 = 159.
The total field cost is given by: C = C1 n1 + C2 n2 = 4 (80) + 9 (159) = $1,751
8.6 OPTIMUM ALLOCATION FOR SEVERAL ITEMS
The formula for the optimum allocation of the sample (equation 8.6 or 8.11) is necessarily computed
for a single characteristic or variable, Y. If it is desired to obtain the most favorable sample allocation
for several characteristics, some kind of compromise must be made. Some alternatives are:
(1) Determine the most important item (or group of highly correlated items) and allocate the
sample to get the best estimate for this item.
(2) Follow the procedure in (1) and increase the size of the sample in some strata to provide
adequate coverage of other important items.
(3) Set up a function which assigns a weight to each item according to its importance; use this
101
function in the allocation to prevent poor sample estimates for the most important
characteristics.
Optimum allocation is most effective for characteristics which vary widely for the individual units;
such as amount of personal income, number of board feet produced by a sawmill, kilos of maize
harvested on a farm, etc.
In sampling for attributes, however, such as the proportion of the population in a class (for example,
in the income class $1,000 - $1,999), proportionate sampling may be the best allocation. It has the
added advantage of being self-weighting.
8.7 STRATIFIED SAMPLING FOR PROPORTIONS
Before concluding this chapter, some comments will be made on the problem of sample allocation
when the object is to estimate a population proportion P. From equation (7.9 of chapter 7, we have
for stratified random sampling,
(8.14)
with proportional allocation,
(8.15)
Ignoring the fpc,
(8.16)
For the sample estimate of the variance, substitute
for the unknown
in any of the
formulas above.
If the optimal allocation can be used, nh will be chosen proportional to
This
allocation will differ substantially from proportional allocation only if the quantities
differ considerably from stratum to stratum. For example, let the Ph lie between 0.3 and 0.7, in which
case
will lie between 0.46 and 0.50. In this situation the optimum allocation will not
be preferred to proportional allocation when the simplicity of the computations involved is another
factor to be taken into account.
102
We can choose nh in order to minimize the variance
Minimum variance for fixed total sample size.
where
represents "proportional to".
Thus,
(8.17)
Minimum variance for fixed cost.
(8.18)
where cost =
The value of n is found by substituting
8.7.1
in equation (8.12) or (8.13).
Illustration
In a firm, 62% of the employees are skilled or unskilled males, 31% are clerical females, and 7% are
supervisory. A sample of 400 employees is taken from a total of 7,000 employees. Based on the
sample, the firm wishes to estimate the proportion that uses certain recreational facilities. Rough
guesses are that the facilities are used by 40 to 50% of the males, 20 to 30% of the females, and 5 to
10 % of the supervisors. How would you allocate the sample among the three groups ? What would
the standard error of the estimated proportion pst be? Ignore the fpc.
We have,
N = 7,000
N1 = 4340,
n = 400
N2 = 2170 and N3 = 490
We guess P1 = 45%, P2 = 25% and P3 = 7.5% as a compromise.
Using equation (8.17), we can allocate the total sample size (n = 400) to the different strata, as
follows:
103
Similarly,
n2 = 116 and n3 = 16
The standard error is given by the equation (8.16):
8.8 DETERMINATION OF SAMPLE SIZE n
In simple random sampling, we saw that the determination of n depended on the sampling variance of
the estimator. In a similar way, for stratified sampling, we need to know the formulas for the
sampling variances of the different methods of allocation in order to determine n for each one of these
methods. Let's summarize the methods of allocation.
1.
Equal samples from each stratum.
2.
Proportionate allocation.
3.
Optimum allocation: fixed budget, varying sampling costs among strata.
4.
Neyman allocation: fixed sample size, equal sampling costs among strata.We also saw that the
stratum sample sizes nh for these methods of allocation were given by:
(8.19)
(equal samples)
104
(8.20)
(8.21)
(8.22)
(proportionate)
(optimum)
(Neyman)
To determine the sample size we need to know the variances of these methods. So, we start with the
formula for the variance of a mean when using stratified random sampling. Recall that the formula is
given by:
(8.23)
Now we substitute the different values of nh into formula (8.23).
After we do this, we get:
(8.24)
(8.25)
(8.26)
(8.27)
Now, let's see how to determine the sample size n to estimate the mean with an error of estimation E.
The sample size is directly related to the error we are willing to tolerate (or the precision we are
required to obtain) in our estimates. As before, we define the error the following way:
105
Error of estimation = E = k S(
where k is the level of reliability.
reliability k, we can write:
)
So, given the precision E that we need to obtain and the level of
(8.28)
We know that as n increases, the variance of the estimate becomes smaller. Therefore, we need to
find the sample size n that will give us a variance equal to B2.
Let's try to solve for n in equation (8.24), that is, when we have equal samples.
(8.29)
Multiply each side of equation (8.29) by N2 and leave the term which contains n on one side of the
equation. After we do this, we obtain:
(8.30)
When we solve for n, we obtain:
(8.31)
Now, when (nh/Nh) is very small (negligible), the fpc = 1 and we may omit from the denominator of
equation (8.31) the term
Applying a similar procedure to equations (8.25), (8.26), and (8.27), we obtain the sample size n
given by the following formulas:
(8.32)
106
(8.33)
(8.34)
As before, when the fpc = 1, the denominator in equations (8.32), (8.33) and (8.34) only contains the
term N2 B2. Another important point to mention is that all the formulas for n have been given in terms
of the stratum population variances (Sh). In practice, we don't know this value and it has to be
estimated by means of a sample or from other sources.
107
Stratified Random Sampling
1.
A chain of department stores is interested in estimating the proportion of accounts receivable that
are delinquent. The chain consists of four stores. To reduce the cost of sampling, stratified
random sampling is used with each store as a stratum. Since no information on population
proportions is available before sampling, proportional allocation is used. From the table given
below, estimate P, the proportion of delinquent accounts for the chain, find its sampling error and
calculate the coefficient of variation of the estimate.
Stratum
I
II
III
IV
Number of accounts receivable
N1 = 65
N2 = 42
N3 = 93
N4 = 25
Sample size
n1 = 14
n2 = 9
n3 = 21
n4 = 6
Sample Proportion of delinquent accounts
Answer: Proportion = .3004; Sampling Error = 0.057975451;
2.
A corporation desires to estimate the total number of man-hours lost, for a given month, because
of accidents among all employees. Since laborers, technicians, and administrators have different
accident rates, it is decided to use stratified random sampling with each group forming a separate
stratum. Data from previous years suggest the following variances for the number of man-hours
lost per employee in the three groups and current data give the following stratum sizes:
I (Laborers) II (Technicians)
N1 = 132
N2 = 92
III (Administrators)
N3 = 27
Determine the Neyman allocation for a sample of n = 30 employees.
Answer: n1 = 18; n2 = 10; n3 = 2
3.
For Exercise 2, estimate the total number of man-hours lost during the given month and place a
bound on the error of estimation. Use the following data obtained from sampling 18 laborers, 10
technicians, and 2 administrators:
I (Laborers)
8
0
6
7
9
18
24
16
0
4
5
2
II (Technicians)
0
32
16
4
8
0
4
0
8
3
1
5
24
12
2
8
108
III (Administrators)
1
8
Answer: Total = 1903.90; Error = 676.80
4.
A zoning commission is formed to estimate the average appraised value of houses in a residential
suburb of a city. It is convenient to use the two voting districts in the suburb as strata because
separate lists of dwellings are available for each district. From the data given below, estimate the
average appraised value for all houses in the suburb and place a bound on the error of estimation
(note that proportional allocation was used):
Stratum I
Stratum II
N1 = 10
N2 = 168
n1 = 20
n2 = 30
Answer: Mean = 13208.63; Error = 560.485
5.
A corporation wishes to obtain information on the effectiveness of a business machine. A
number of division heads will be interviewed by telephone and asked to rate the equipment on a
numerical scale. The divisions are located in North America, Europe, and Asia. Hence,
stratified sampling is used. The costs are larger for interviewing division heads located outside
of North America. The following costs per interview, approximate variances of the ratings, and
Ni’s have been established:
Stratum I (North America)
Stratum II (Europe)
Stratum III (Asia)
c 1 = $9
c 2 = $25
c 1 = $36
N 1 = 112
N 2 = 68
N 3 = 39
The corporation wants to estimate the average rating with
which achieves this bound and find the appropriate allocation.
Choose the sample size, n,
Answer: n = 27; n1 = 16; n2 = 7; n3 = 4
6.
A school desires to estimate the average score that would be obtained on a reading
comprehension exam for students in the sixth grade. The school has students divided into three
tracks, with the fast learners in track I and the slow learners in track III. It was decided to stratify
on tracks since this method should reduce variability of test scores. The sixth grade contains 55
109
students in track I, 80 in track II, and 65 in track III. A stratified random sample of 50 students is
proportionally allocated and yields simple random samples of n1 = 14, n2 = 20, y n3 = 16 from
tracks I, II, and III, respectively. The test is administered to the sample of students with the
following results:
Track I
80
68
72
85
90
62
61
92
85
87
91
81
79
83
Track II
85
48
53
65
49
72
53
68
71
59
82
75
73
78
69
81
59
52
61
42
Track III
42
36
65
43
53
61
42
39
32
31
29
19
14
31
30
32
Estimate the average score for the sixth grade, and place a bound on the error of estimation.
7.
Suppose the average test score for the class in Exercise 6 is to be estimated again at the end of
the school year. The cost of sampling are equal in all strata, but the variances differ. Find the
optimum (Neyman) allocation of a sample of size 50 using the data in Exercise 6 to approximate
the variances.
Answer: n1 = 11; n2 = 21; n3 = 18
8.
Using the data in Exercise 6, find the sample size required to estimate the average score with a
bound of 4 points on the error of estimation. Use proportional allocation.
Answer: n = 33
9.
Repeat Exercise 8 using Neyman allocation. Compare the result with the answer to Exercise 8.
Answer: n = 32
10. A forester wants to estimate the total number of farm-acres planted in trees for a state. Since the
number of acres of trees varies considerably with the size of the farm, it is decided to stratify on
farm sizes. The 240 farms in the state are placed in one of four categories according to size. A
stratified random sample of 40 farms, selected using proportional allocation, yields the following
results on number of acres planted in trees:
110
Stratum I
0-200 acres
Stratum II
200-400 acres
Stratum III
400-600 acres
Stratum IV
600+ acres
N1 = 86
n1 = 14
N2 = 72
n2 = 12
N3 = 52
n3 = 9
N4 = 30
n4 = 5
97, 67, 42, 125,
25, 92, 105, 86,
27, 43, 45, 59,
53, 21
125, 155, 67, 96,
256, 47, 310,
236, 220, 352,
142, 190
142, 256, 310,
440, 495, 510,
320, 396, 196
167, 655,
220, 540,
780
Estimate the total number of acres of trees on farms in the state, and place a bound on the error of
estimation.
11. The study of Exercise 10 is to be made yearly with the bound on the error of estimation 500
acres. Find an approximate sample size to achieve this bound if Neyman allocation is to be used.
Use the data in Exercise 10.
Answer: n = 156
12. A psychologist working with a group of mentally retarded adults desires to estimate their average
reaction time to a certain stimulus. He feels that men and women probably will show a
difference in reaction times so he wants to stratify on sex. The group of 96 people contains 43
men. In previous studies of this type it has been observed that the times range from 5 to 20
seconds for men and 3 to 14 seconds for women. The costs of sampling are the same for both
strata. Using optimum allocation find the approximate sample size necessary to estimate the
average reaction time for the group to within 1 second.
Answer: n = 29
13. A county government is interested in expanding the facilities of a day-care center for mentally
retarded children. The expansion would increase the cost of enrolling a child in the center. A
sample survey will be conducted to estimate the proportion of families with retarded children that
would make use of the expanded facilities. The families are divided into those who use the
existing facilities and those who do not. Some families live in the city in which the center is
located and some live in the surrounding suburban and rural areas. Thus, stratified random
sampling is used with users in the city, users in the surrounding county, nonusers in the city, and
nonusers in the county forming strata 1, 2, 3, and 4, respectively. Approximately 90% of the
present users and 50% of the present nonusers would use the expanded facilities. The cost of
obtaining an observation from a user is $4.00 and from a nonuser is $8.00. The difference in cost
is due to the fact that nonusers are difficult to locate.
Existing records give N1 = 97; N2 = 43; N3 = 145; N4 = 68. Find the approximate sample size
and allocation necessary to estimate the population proportion with a bound of 0.05 on the error
of estimation.
111
Answer: n = 158; n1 = 39; n2 = 17; n3 = 69; n4 = 33
14. The survey in Exercise 13 was conducted and yields the following proportion of families who
would use the new facilities:
p1 = .87, p2 = .93, p3 = .60, p4 = .53
Estimate the population proportion, P, and place a bound on the error of estimation. Was the
desired bound achieved?
Answer: Proportion = .701; Error = .0503
15. Suppose in Exercise 13 the total cost of sampling is fixed at $400. Choose the sample size and
allocation which minimizes the variance of the estimator, pst, for this fixed cost.
Answer: n = 62; n1 = 17; n2 = 6; n3 = 26; n4 = 13
16. The following data show the stratification of all the farms in a county by farm size and the
average acres of corn per farm in each stratum.
a.
Average
Corn Acres
Farm Size
(acres)
Number of
farms
Nh
Standard
deviation
Sh
0-40
394
5.4
8.3
68.89
3,270.20
27,142.66
41-80
461
16.3
13.3
176.89
6,131.30
81,546.29
81-120
391
24.3
15.1
228.01
5,904.10
89,151.91
121-160
334
34.5
19.8
392.04
6,613.20
130,941.26
161-
430
52.0
28.6
817.96
12,298.00
351,722.80
Total or mean
2,010
26.3
17.0
289.00
34,216.80
680,505.02
NhSh
For a sample of 100 farms, allocate the sample size to each stratum under:
(i) Proportional allocation
(ii) Optimum allocation
(iii) Equal allocation
b.
For a sample of 100 farms, compute the sampling error of the estimated total for
(i) a simple random sample
(ii) proportional allocation
(iii) Neyman allocation
c.
On the basis of this analysis, which of the three methods of allocating the sample would you
recommend?
112
17. It is desired to estimate the total value of farm products for a population of 5,900 farms. Means
and variances are available from a past census on the value of farm products classified by farm
size and tenure of the operator:
S i z e a nd tenure
A l l farms
Number
of
farms
(N h )
5,900
Average
value of
products
Variance
Standard
deviation
(S h)
3,500
97,000,000
N hS h
9,848.86
S i z e o f F arm
U n d e r 1 0 acres
1 0 t o 4 9 acres
5 0 t o 9 9 acres
1 0 0 t o 1 7 9 acres
1 8 0 t o 2 5 9 acres
2 6 0 t o 9 9 9 acres
1,000+
590
1,600
1,150
1,200
490
650
220
1,200
1,500
2,200
3,600
5,500
6,200
18,000
18,000,000
15,000,000
18,000,000
35,000,000
70,000,000
200,000,000
400,000,000
4,242.64
3,872.98
4,242.64
5,916.08
8,366.60
14,142.14
20,000.00
2,503,1 5 8 . 0 1
6,196,7 7 3 . 3 5
4,879,0 3 6 . 7 9
7,099,2 9 5 . 7 4
4,099,6 3 4 . 1 3
9,192,3 8 8 . 1 6
4,400,0 0 0 . 0 0
38,370,2 8 6 . 1 8
2,600
6,900
18,000
3,500
35,000,000
110,000,000
510,000,000
40,000,000
5,916.08
10,488.09
22,583.18
6,324.56
19,523,0 6 3 . 2 8
6,922,1 3 8 . 4 0
1,129,1 5 8 . 9 8
11,953,4 0 9 . 5 6
39,527,7 7 0 . 2 2
349,620,000,000
Tenure of
Operator
Full owner
Part owner
Manager
Tenant
3,300
660
50
1,890
289,200,000,000
a.
Compute the standard error of the total value of products from a proportionate stratified
sample of 300 farms for each of the two methods of stratification (by size and by tenure of
the operator).
b.
Which method of stratification is more efficient for a proportionate sample?
c.
Compute the standard error of the estimate of the total value of products, using a simple
random sample of 300 farms.
d.
For both methods of stratification, use the Neyman allocation for a sample of 300 farms, and
compute:
(I) The number of sample farms in each stratum
(ii) The standard error of the estimate of the total value of products.
e.
On the basis of this analysis, which of the four methods of allocating the sample would you
recommend?
f.
Assume that the sample was stratified by tenure and allocated by the optimum method.
Assume also that the following means by strata were obtained:
113
Tenure
Mean Value of Products
Full owner
Part owner
Manager
Tenant
2,900
6,400
20,000
4,000
Estimate the mean value of products for the population of 5,900 farms.
g.
Describe how you would calculate the standard error of the mean computed in (f) above after
the survey results are available?
18. With three strata, the values of the Nh, Sh, and ch are as follows:
Stratum
Nh
Sh
ch
1
860
5
2
2
640
4
3
3
1230
6
5
a.
Find the sample size in each stratum for a sample of size 200 under an optimum allocation.
b.
How much will the total field cost be?
19. Using the list of 600 households residing in 30 villages (Appendix IV), select a SRS-WOR of 20
households, and on the basis of the data on the size of these 20 sample households, do the
following :
a.
Determine the number of households for each zone and then select a sample of size nh(h = 1,
2, 3) in each of the three zones. Use proportional allocation.
b.
Estimate for each of the 3 zones separately:
(i) the total number of persons and its sampling error.
(ii) the average household (HH) size and its sampling error.
c.
Estimate for the entire population :
(i) the total number of persons and its sampling error.
(ii) the average HH size and its sampling error.
(iii) the coefficient of variations (CVs) for both number of persons and average HH size.
d.
Compare the population estimates and standard errors obtained from exercise 19 with those
obtained from SRS-WOR in chapter 5.
114
CHAPTER 9
RATIO ESTIMATES
_______________________________________________________________________________________________
9.1
REASONS FOR CONSIDERING USE OF RATIO ESTIMATES
In earlier chapters we dealt with the problem of how to design the most efficient sample (from the
point of view of minimizing the standard error) using as much relevant information as we can obtain
about the population. We have seen how to use information for stratification with either
proportionate sampling or optimum allocation, how to take unit costs into account, and how to
choose between different kinds of sampling units. We have seen how to use whatever knowledge we
have of costs and of the variances of different methods of sampling, in order to produce the
maximum amount of information with the resources we have available. All of this analysis has been
in terms of fairly simple estimates such as
in which the estimates were prepared by
using only the sample data, the total number of units (N) in the population and the probabilities of
selection. Thus, for simple random sampling,
For stratified sampling
There are similar formulas for cluster sampling, or for estimation of proportions
There are, however, more complex methods of estimating these statistics, which under certain
circumstances can result in very large reductions in the standard errors.
Moreover, there are other types of statistics which we wish to measure--such as ratios of two
characteristics, change over time of a single characteristic, etc. For example, we may obtain
information on wages and salary payments and on number of hours worked, but we may be more
interested in estimating the average hourly earnings, rather than total wages and salaries or total
hours worked. From surveys covering two different periods of time, we may be more interested in
finding out whether total wages have gone up or down than in measuring the level at any one time.
The analysis of the standard errors estimated ratios also helps with the problem of producing more
efficient estimates of means and totals.
We shall investigate the simplest and most commonly used method of improving the reliability of an
estimated mean or total, by the use of a special estimating technique which produces a "ratio
estimate." A number of other very powerful tools are useful in particular situations; for example,
difference estimates and regression estimates, double sampling (in which the final sample is selected
from a previously selected larger sample that provides information for improving the final selection
or the estimation procedure), and special methods for the estimation of time series. However, we
will only discuss in this chapter ratio estimates.
9.2
RATIO ESTIMATES OF AGGREGATES
Ratio estimation is the most commonly used of the more complex estimation techniques available to
the statistician. It is also the easiest to apply. It is appropriate whenever the units of the population
possess two characteristics that are positively correlated--the higher the correlation, the greater the
gain from using this technique. The simplest kind of ratio estimator of the form given by equation
(9.1), is an estimate of Y (the population aggregate):1
(9.1)
Here, and are the ordinary estimates of the aggregates of two characteristics Y and X; the
aggregate X must be known in order to estimate the aggregate Y.
To compute
it is not necessary to compute
and
since, for a self-weighting sample,
However, the formula in (9.1) is useful in deriving the variance of
Ratio estimates of aggregates are ordinarily applied in the three situations described in sections 9.2.1
to 9.2.3 below.
9.2.1
Ratio to Same or Related Characteristic at an Earlier Time Period
X is the same type of characteristic as Y, but X refers to an earlier time period during which a
complete census was taken. For example, we may have taken a full census of manufacturers in one
year, and wish to take a sample survey the following year. Suppose we wish to estimate the total
1.
The ratio estimator of a mean
(as an estimate of
) is obtained by dividing
coefficient of variation as the estimate.
116
by N; it has the same
value of shipments. For each manufacturing establishment in the sample, we obtain not only yi, the
value of shipments in the survey year, but also xi, the value during the preceding census year. Then
and
would be estimates from the sample of total shipments for the two years, obtained by the
methods discussed earlier. X is the total value of shipments tabulated from the full census. In this
application, the survey is actually used to measure the rate of change between the two years, using
the identical sample of establishments. The rate of change is then multiplied by the census total for
the previous year.
9.2.2
Ratio of Two Related Characteristics at the Same Time Period
Y and X are two different characteristics for the same time period, which are known to be positively
correlated. The true value of the aggregate X is known. For example, for the ith farm in a sample, xi
may be the total hectares in the farms, and yi the payments for farm labor; the total hectares in all
farms, X, is known from another source. If, in general, the larger farms pay more total wages for
farm labor than the smaller ones, the ratio estimate can drastically reduce the sampling error. In this
application, the survey is used to measure a rate (such as the average payment per hectare) which is
multiplied by the known number of hectares.
9.2.3
Ratio of a Subset to the Total
The characteristic Y is a subset of X, varying roughly in proportion to X. For example, xi may be total
acres in the ith farm in the sample, and yi the acres planted to a particular crop on that farm. Another
application is the case in which X is the total number of units of analysis and Y is the number of these
having a particular attribute. For example, yi might be the number of persons in the labor force in the
ith cluster; xi is the total number2 of persons in this cluster; and X is the known total number of
persons in the population. In these cases, the survey is used to measure a ratio
which is then
multiplied by the population total (X) for the characteristic in the denominator of the ratio.
9.3
VARIANCE AND BIAS OF A RATIO ESTIMATE
In examining
it is clear that X is not derived from the sample. The sampling error in the
estimate
is, therefore, dependent on the sampling error of the ratio,
with X
having only the effect of a constant multiplier. Therefore, an analysis of the sampling error of
2.
In cluster sampling, the estimate of the total number of units of analysis will be a random variable, which is usually not exactly equal to the
true figure . Hence, the proportion of units having the attribute must be treated as a ratio of random variables.
117
as an estimate of R =
is closely related to that of the ratio
The mathematical form of the distribution of the ratio of two random variables from sample to
sample is much more complicated than that of the simpler estimates discussed earlier. It involves the
relationship of two variables, both of which have sampling errors. Hence, more care is required in
deciding when to use such ratios. The following facts about the variance of ratios and ratio estimates
will indicate when to use a ratio estimator to estimate a mean or an aggregate. They also tell us what
error to expect when using the estimate.
9.3.1
Variance of Ratios and Ratio Estimates
The variance of an estimated ratio
is approximately
(9.2)
where R is the population ratio
(a ratio of aggregates), and
Similarly, the variance of the ratio estimate of a total ,
(9.3)
=
The alternative form of this equation is
(9.3a)
and is estimated by ,
118
is
(9.3b)
Equations (9.2) and (9.3) are somewhat simpler if expressed in terms of the coefficient of variation,
CV. The square of the coefficient of variation (that is, the rel-variance) of
of
is the same as that
and can be expressed as
(9.4)
In the above formulas, D is the coefficient of correlation between the variables Y and X. It represents
the correlation of Y and X, not for the elementary units of analysis but for the units used for
sampling. For example, if Y and X represent the incomes of persons in two different years, but the
sample is a cluster sample, the correlation coefficient D will be the correlation between the values Yi
and Xi where Yi is the sum of the incomes for all persons in the ith cluster in the year of estimation
and Xi is the corresponding sum in the base year. Frequently,
is referred to as the sampling
covariance between
and
and the symbol
is used for it. It can be calculated exactly as the
variance, but with the cross product
replacing the square
wherever it
occurs. Thus, for simple random sampling we have
(9.5)
where
(9.6)
and an estimate of SYX can be made from the sample by using
(9.7)
The corresponding estimate of D, designated by D’, is obtained by putting sample values of
and
in place of the population values, in equation (9.5), and solving for D, which then
becomes D’ . The D’ may also be computed directly from
119
(9.8)
For a stratified sample, with estimates of totals given by
and
(9.9)
where
are within-strata covariances and are computed in exactly the same way, but are
restricted to the values within each stratum.
9.3.2
Gains with a Ratio Estimate
If we examine equation (9.4), the formula for the rel-variance of an estimate of a total,
(9.4)
we see that CV² of the ratio estimate
plus the term
can be expressed as CV² of the simpler estimate
minus the term
Whether we gain or lose by the
use of a ratio estimate, as compared with the simpler estimate
depends on whether
is smaller or greater than zero. Another way of expressing this is the
following:
(1) If
a ratio estimate is more efficient
(2) If
a ratio estimate is less efficient
120
(3) If
9.3.2.1
both estimates have the same standard error.
High Correlation.
To see the implication of these facts in some common situations, consider the example of a census of
manufacturers which was conducted in one year, followed by a sample the next year. Let yi and xi
represent the values of shipments for the same sample firm in two consecutive years. In this
case
and
are nearly the same, and
is approximately 1.
Furthermore, there will be a very high correlation between Y and X, probably about 0.90 or 0.95.
Consequently, a ratio estimate will result in a substantial gain in accuracy. The amount of the gain
can be found as follows: if
Equation (10.4) becomes
and if D = .90, we have
In other words, the use of a ratio estimate achieves an 80 percent reduction in variance.
If D = .95,
becomes equal to
and the reduction is 90 percent. Looking
at the result in another way, the ratio estimate is as effective as using a sample 5 times (or 10 times)
as large.
9.3.2.2
Low Correlation.
Consider now the situation described in section 9.2.3 in which Y is a subset of X. In such cases, the
correlation is likely to be quite low, unless
practice, if
is fairly large--for example, greater than ½. In
is less than about 20 percent, a ratio estimate may increase the sampling error
although, generally, not much. If
is greater than 40 or 50 percent, a ratio estimate will usually
121
improve the efficiency; the closer to 100 percent, the more the improvement. Between 20 and 40
percent, the differences between the two types of estimates will be small. Thus, for example, in a
labor force survey, the use of ratio estimates probably provides an important improvement in the
estimate of the number of employed (which comprises a fairly high proportion of the adult
population) but probably results in a slight increase in the standard error of the estimate of
unemployed.
9.3.3
Bias of the Ratio Estimate
The ratio estimate is a biased estimate. This can easily be demonstrated by constructing a small
population with values Yi and Xi for each element, taking all possible samples of two or three
elements, and computing
for each sample. It will be seen that the average of the ratios is not
the true average. However, the bias tends to be negligible for moderately large samples. In most
practical applications, the bias is so small compared with the advantage gained in reducing the
sampling error, that the ratio estimate is preferred over the unbiased estimate.
9.3.4
Consistent Estimates
A ratio estimate, although biased, is a consistent estimate. This means that, if we use a large enough
sample, we can be sure that the estimate will be as close as we like to the true value. Not only does
the standard error decrease with increasing sample size, but the bias is also reduced.
9.3.5
Confidence Limits
For reasonably large samples, ratio estimates are normally distributed (for the kinds of populations
dealt with in practice). Consequently, if we can compute the standard error of the ratio estimate, we
can construct the same type of confidence limits for and as for and that is, we can say that
we have a 68-percent chance that a range around the estimate of plus and minus one standard error
will cover the true figure, a 95-percent chance that a range of plus and minus two standard errors will
cover the true figure, etc.
9.3.6
Minimum Sample Size Required
Sections 9.3.3 to 9.3.5 above refer to the fact that moderately large samples are needed to make the
bias negligible, and to provide a reasonably normal distribution of sample estimates. When is the
sample large enough? The following working rule has been suggested: If the sample size exceeds
30 and if the coefficients of variation of and are both less than 10 percent, then the bias is
negligible and we can assume that the theory for the normal distribution applies. The first condition
does not mean that a ratio estimate is necessarily better than a simple unbiased estimate whenever
n > 30; it means this size of sample is required before the formulas for sampling error have the usual
meaning in terms of confidence intervals.
122
9.3.7
Formula for Bias
An approximation to the bias of an estimate of a ratio of two variables
where D and R are defined as in section 3.1. For the estimate of a total,
Even with low values of D this will be small compared with the standard error of
that the sample is reasonably large so that
is
the bias is
provided only
is small.
These bias formulas are presented for analytical purposes. They are never used to adjust estimates.
In situations where the bias would be expected to be significantly large, we would either increase the
sample size or use a different method of estimation.
9.3.8
Danger in Use of Ratio Estimate
If ratio estimates are applied separately for a large number of subgroups of the population, with a
small sample in each subgroup, the bias in the subgroup may accumulate and become too large to
ignore. For example, suppose a relatively small sample of persons is classified by separate age-sex
groups--300 persons divided into 5-year age groups by sex. There would be about 30 such groups.
Suppose we know the true total population in each of these 30 groups. For any statistic we are
interested in, we could compute a separate ratio estimate for the persons in each of the 30 groups,
and then get a final estimate by adding the 30 results. The average size of sample in each group
would be 10. Since there would be only a small sample in each of the age groups for which a ratio
estimate would be formed, the accumulation of 30 different ratio estimates could result in a serious
bias. In such a case, the use of ratio estimation group-by-group is not recommended.
9.3.9
Illustration
Suppose that a complete census of the value of manufacturing shipments was taken in 1981. The
following table shows the value of shipments in each of a simple random sample of the value of 10
shipments drawn from the value of 30 shipments. The problem is to estimate the total value of
shipments in 1982. The true 1981 total, X is assumed to be known . Its value is $19.5 billion.
123
Value of shipments in
1981 (xi)
0.3
1.1
0.5
0.4
1.0
0.7
0.2
0.3
2.4
0.1
Value of shipments in
1982 (yi)
0.1
0.6
0.8
0.6
1.0
0.8
0.9
0.8
2.7
0.2
We have,
N = 30, n = 10
Compute the estimate of the total and the variance, the coefficient of variation of the estimate and
the confidence interval for Y by using (a) a method of simple random sampling and (b) a method of
ratio estimates.
(a) Simple random sampling
(1)
(2)
(3)
(4)
124
(5) 95 % confidence interval for Y is
(b) Ratio Estimates
(1)
=
Using equation (9.3b),
(2)
=
=
(3)
(4)
(5) A 95 % confidence interval for Y is
($16.46, $30.90) billion
125
Formulas for Ratio Estimation Variances
Population Ratio R:
Estimated Variance of r:
Ratio Estimator of the Population Total Ty:
Estimated Variance of
Ratio Estimator of a Population Mean :y:
126
Estimated Variance of
127
Ratio Estimation
1. A forester is interested in estimating the total volume of trees in a timber sale. He records the
volume for each tree in a simple random sample. In addition he measures the basal area for each
tree marked for sale. He then uses a ratio estimator of total volume.
The forester decides to take a simple random sample of n = 12 from the N = 250 trees marked for
sale. Let x denote basal area and y the cubic foot volume for a tree. The total basal area for all
250 trees, Tx, es 75 square feet. Use the data below to estimate Ty, the total cubic foot volume
for those trees marked for sale, and place a bound on the error of estimation.
Tree
Sampled
1
2
3
4
5
6
7
8
9
10
11
12
Basal Area
(x)
Cubic foot volume
(y)
.3
.5
.4
.9
.7
.2
.6
.5
.8
.4
.8
.6
6
9
7
19
15
5
12
9
20
9
18
13
0.09
0.25
0.16
0.81
0.49
0.04
0.36
0.25
0.64
0.16
0.64
0.36
36
81
49
361
225
25
144
81
400
81
324
169
3.24
20.25
7.84
292.41
110.25
1
51.84
20.25
256
12.96
207.36
60.84
3=7
3 = 142
3=4
3 = 1,976
3 = 1,044.24
2. Use the y-data in Exercise 1 to compute an estimate of Ty using
Place a bound on the error
of estimation. Compare your results to those obtained in Exercise 1.
3. A consumer survey was conducted to determine the ratio of the money spent on food to the total
income per year for households in a small community. A simple random sample of 14
households was selected from 150 in the community. Sample data are tabulated below. Estimate
R, the population ratio, and place a bound on the error of estimation.
128
Household
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Total Income
(xi)
Amount spent on
food (yi)
5,010
12,240
9,600
15,600
14,400
6,500
8,700
8,200
14,600
12,700
11,500
10,600
7,700
8,500
990
2,524
1,935
3,123
2,760
1,337
1,756
2,132
3,504
2,286
2,875
2,226
1,463
1,905
25,100,100
149,817,600
92,160,000
243,360,000
207,360,000
42,250,000
75,690,000
67,240,000
213,160,000
161,290,000
132,250,000
112,360,000
59,290,000
72,250,000
980,100
6,370,576
3,744,225
9,753,129
7,617,600
1,787,569
3,083,536
4,545,424
12,278,016
5,225,796
8,265,625
4,955,076
2,140,369
3,629,025
4,959,900
30,893,760
18,576,000
48,718,800
39,744,000
8,690,500
15,277,200
17,482,400
51,158,400
29,032,200
33,062,500
23,595,600
11,265,100
16,192,500
3=145,850
3= 30,816
3= 1,653,577,700
3= 74,376,066
3= 348,648,860
Answer: Ratio = .2113; Error = .0126
4. A corporation is interested in estimating the total earnings from sales of color television sets at
the end of a given three month period. The total earnings figures are available for all districts
within the corporation for the corresponding three month period of the previous year. A simple
random sample of 13 districts offices is selected from the 123 offices within the corporation.
Using a ratio estimator, estimate Ty and place a bound on the error of estimation. Use the data in
the table below and take Tx = 128,200.
Office
1
2
3
4
5
6
7
8
9
10
11
12
13
Three month
data from
previous year
(x i)
Three month
data from
current year
(y i)
550
720
1,500
1,020
620
980
928
1,200
1,350
1,750
670
729
1,530
610
780
1,600
1,030
600
1,050
977
1,440
1,570
2,210
980
865
1,710
302,500
518,400
2,250,000
1,040,400
384,400
960,400
861,184
1,440,000
1,822,500
3,062,500
448,900
531,441
2,340,900
372,100
608,400
2,560,000
1,060,900
360,000
1,102,500
954,529
2,073,600
2,464,900
4,884,100
960,400
748,225
2,924,100
335,500
561,600
2,400,000
1,050,600
372,000
1,029,000
906,656
1,728,000
2,119,500
3,867,500
656,600
630,585
2,616,300
3= 13,547
3= 15,422
3= 15,963,525
3= 21,073,754
3= 18,273,841
Answer: Total = 14,943.7809; Error = 7,353.67
5.
Use the data in Exercise 4 to estimate the mean earnings for offices within the corporation.
Place a bound on the error of estimation.
129
Answer: Mean = 1,186.5348; Error = 59.79
6.
An investigator has a colony of N = 763 rats which have been subjected to a standard drug.
The average length of time to thread a maze correctly under influence of the standard drug
was found to be :x = 17.2 seconds. The investigator now would like to subject a random
sample of 11 rats to a new drug. Estimate the average time required to thread the maze while
under the influence of the new drug. Place a bound on the error of estimation. (Hint: it is
reasonable to employ a ratio estimator for :y if we assume that the rats will react to the new
drug in much the same way as they did the standard drug.)
Rat
Standard Drug
(xi)
1
2
3
4
5
6
7
8
9
10
11
New Drug
(yi)
14.3
15.7
17.8
17.5
13.2
18.8
17.6
14.3
14.9
17.9
19.2
15.2
16.1
18.1
17.6
14.5
19.4
17.5
14.1
15.2
18.1
19.5
204.49
246.49
316.84
306.25
174.24
353.44
309.76
204.49
222.01
320.41
368.64
231.04
259.21
327.61
309.76
210.25
376.36
306.25
198.81
231.04
327.61
380.25
217.36
252.77
322.18
308
191.4
364.72
308
201.63
226.48
323.99
374.4
3= 181.2
3= 185.3
3= 3,027.06
3= 3,158.19
3= 3,090.93
Answer: Mean = 17.5892; Error = .2710
7.
A group of 100 rabbits is being used in a nutrition study. A pre-study weight is recorded for
each rabbit. The average of these weights is 3.1 pounds. After two months the experimenter
wants to obtain a rough approximation of the average weight of the rabbits. He selects n = 10
rabbits at random and weighs them. The original weights and current weights are presented
below:
Rabbit
1
2
3
4
5
6
7
8
9
10
Original weight
3.2
3.0
2.9
2.8
2.8
3.1
3.0
3.2
2.9
2.8
Current weight
4.1
4.0
4.1
3.9
3.7
4.1
4.2
4.1
3.9
3.8
Estimate the average current weight and place a bound on the error of estimation.
8.
A social worker wants to estimate the ratio of the average number of rooms per apartment to
the average number of people per apartment in an urban ghetto area. He selects a simple
random sample of 25 apartments from the 275 in the ghetto area. Let xi denote the number of
people in apartment I, and let yi denote the number of rooms in apartment I. From a count of
the number of rooms and number of people in each apartment, the following data are
130
obtained:
Estimate the ratio of average number of rooms to average number of people for this area, and
place a bound on the error of estimation.
Answer: Ratio = .283; Error = .0616
9.
A forest resource manager is interested in estimating the number of dead fir trees in a 300
acre area of heavy infestation. Using an aerial photo, he divides the area into 200 one and a
half acre plots. Let x denote the photo count of dead firs and y the actual ground count for a
simple random sample of n = 10 plots. The total number of dead fir trees obtained from the
photo count is Tx = 4,200. Use the sample data below to estimate Ty, the total number of
dead firs in the 300 acre area. Place a bound on the error of estimation.
Plot sampled
Photo count
(xi)
Ground count
(yi)
1
2
3
4
5
6
7
8
9
10
12
30
24
24
18
30
12
6
36
42
18
42
24
36
24
36
14
10
48
54
144
900
576
576
324
900
144
36
1296
1764
324
1,764
576
1,296
576
1,296
196
100
2,304
2,916
216
1,260
576
864
432
1,080
168
60
1,728
2,268
3= 6,660
3=11,348
3=8,652
Answer: Total = 5,492.3077; Error = 428.4381
10.
Members of a teachers’ association are concerned about the salary increases given to high
school teachers in a particular school system. A simple random sample of n = 15 teachers is
selected from an alphabetical listing of all high school teachers in the system. All 15 teachers
are interviewed to determine their salaries for this year and the previous year. Use these data
to estimate R, the rate of change, for N = 750 high school teachers in the community school
system. Place a bound on the error of estimation.
131
Teacher
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Past year’s
salary
(xi)
Present year’s
salary
(yi)
5,400
6,700
7,792
9,956
6,355
5,108
7,891
5,216
5,416
5,397
8,152
6,436
9,192
7,006
7,311
5,600
6,940
8,084
10,275
6,596
5,322
8,167
5,425
5,622
5,597
8,437
6,700
9,523
7,279
7,582
29,160,000
44,890,000
60,715,264
99,121,936
40,386,025
26,091,664
62,267,881
27,206,656
29,333,056
29,127,609
66,455,104
41,422,096
84,492,864
49,084,036
53,450,721
31,360,000
48,163,600
65,351,056
105,575,625
43,507,216
28,323,684
66,699,889
29,430,625
31,606,884
31,326,409
71,182,969
44,890,000
90,687,529
52,983,841
57,486,724
30,240,000
46,498,000
62,990,528
102,297,900
41,917,580
27,184,776
64,445,797
28,296,800
30,448,752
30,207,009
68,778,424
43,121,200
87,535,416
50,996,674
55,432,002
3= 103,328
3= 107,149
3= 743,204,912
3=798,576,051
3= 770,390,858
Answer: Ratio = 1.037; Error = 0.001391
11.
An experimenter was investigating a new food additive for cattle. Midway through the two
month study, he was interested in estimating the average weight for the entire herd of N =
500 steers. A simple random sample of n = 12 steers was selected from the herd and
weighed. These data and prestudy weights are presented below for all cattle sampled.
Assume :x, the pre-study average, was 880 lbs. Estimate :y, the average weight for the herd,
and place a bound on the error of estimation. All the weights below are in pounds.
Steer
1
2
3
4
5
6
7
8
9
10
11
12
Pre-study weight
(xi)
Present weight
(yi)
815
919
690
984
200
260
1,323
1,067
789
573
834
1,049
897
992
752
1,093
768
828
1,428
1,152
875
642
909
1,122
664,225
844,561
476,100
968,256
40,000
67,600
1,750,329
1,138,489
622,521
328,329
695,556
1,100,401
804,609
984,064
565,504
1,194,649
589,824
685,584
2,039,184
1,327,104
765,625
412,164
826,281
1,258,884
731055
911648
518880
1075512
153600
215280
1889244
1229184
690375
367866
758106
1176978
3= 9,503
3= 11,458
3= 8,696,367
3= 11,453,476
3= 9,717,728
Answer: Mean = 1,061.0376; Error = 139.9468
12.
An advertising firm is concerned about the effect of a new regional promotional campaign on
the total dollar sales for a particular product. A simple random sample of n = 20 stores is
drawn from the N = 452 regional stores in which the product is sold. Quarterly sales data are
obtained for the current three-month period and the three-month period prior to the new
campaign. Use these data to estimate Ty, the total sales for the current period, and place a
bound on the error of estimation. Assume Tx = 216,256.
132
Stor
e
Pre-Campaign
Sales (xi)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Present
Sales (yi)
208
400
440
259
351
880
273
487
183
863
599
510
828
473
924
110
829
257
388
244
239
428
472
276
363
942
294
514
195
897
626
538
888
510
998
171
889
265
419
257
43,264
160,000
193,600
67,081
123,201
774,400
74,529
237,169
33,489
744,769
358,801
260,100
685,584
223,729
853,776
12,100
687,241
66,049
150,544
59,536
57,121
183,184
222,784
76,176
131,769
887,364
86,436
264,196
38,025
804,609
391,876
289,444
788,544
260,100
996,004
29,241
790,321
70,225
175,561
66,049
49712
171200
207680
71484
127413
828960
80262
250318
35685
774111
374974
274380
735264
241230
922152
18810
736981
68105
162572
62708
3= 9,506
3= 10,181
3=
5,808,962
3=
6,609,029
3=
6,194,001
13.
Use the data of Exercise 12 to determine the sample size required to estimate Ty with a bound
on the error of estimation equal to $3,800.
Answer: n = 14.
14.
A 10-percent simple random sample of housing units in a village has been selected producing
the 12 housing units listed below. At each sample unit, information was obtained on the
number of persons in the household and the total annual earnings; the results are given below.
It is also known from independent sources that the total population of all households in the
village is 600 persons.
133
Sample
unit
15.
Total
persons
Total
earnings
1
6
$ 7,000
2
6
8,000
3
5
3,000
4
8
10,000
5
4
2,000
6
2
1,000
7
4
2,000
8
5
3,000
9
1
1,000
10
7
8,000
11
4
1,000
12
5
6,000
Total
57
$52,000
a.
Estimate the total earnings in all households in the village using a direct inflation
factor.
b.
Estimate the total earnings in all households in the village using a ratio estimate.
c.
Use the sample results to estimate the coefficient of variation for each of the above
estimates.
The following table shows the total hectares in three farms along with the payments for farm
labor draw1n from 30 farms. The true value of the total hectares of all farms, X is assumed to
be 800.
Farm
(I)
Hectares
(xi)
Payments
(yi)
1
5
382
2
8
467
3
10
701
for all farms Y.
a.
Estimate the total payments
b.
Estimate the variance of
c.
d.
Compute the coefficient of variation of
Find a 95% confidence interval for Y.
134
Chapter 10
CLUSTER SAMPLING
________________________________________________________________________________________
10.1
DESCRIPTION OF CLUSTER SAMPLING
The discussion so far has been about sampling methods in which the units of analysis (people, farms,
business firms, etc.) were considered as arranged in a list (or its equivalent) and a sample of
individual units could be selected directly from the list. Now we will consider a sampling procedure
in which the units of analysis in the population are grouped into clusters and a sample of clusters
(rather than a sample of individual units of analysis) is selected. The sample clusters then determine
the units to be included. The determination may be made in either of two ways:
(1)
The sample could include all units in the selected clusters. This is usually referred to as
single-stage cluster sampling.
(2)
A subsample of units in the selected clusters could be selected for enumeration. This is
called multi-stage cluster sampling, or simply multi-stage sampling.
There are two main reasons for using cluster sampling. Often there is no adequate frame (such as a
list) from which to select a sample of the elements in the population, and the cost of constructing
such a frame may be too great. In other cases, such a frame may exist but the savings in field costs
obtained by cluster sampling (on some kind of geographical basis) may make this method more
efficient than a simple random sample from a list. In most practical situations, a sample of a given
number of units selected at random will have smaller variance than a sample of the same size
selected in clusters; nevertheless, when cost is balanced against precision, the cluster sample may be
more efficient.
Even though the units in which we are interested are not selected directly, the probability of selecting
a cluster and each unit in it (i.e., the probability of selecting a unit from the population) is fixed in
advance; consequently, cluster sampling satisfies the criterion for probability sampling.
Let us consider some examples to see how cluster sampling works.
10.1.1
Single-Stage Cluster Sampling
To draw a sample of persons, it would generally not be feasible to obtain a list of all persons, and
then to select a sample from the list. It might be possible to find a list of families. We could then
select a sample of families and obtain information by interview concerning all persons in the selected
families. This is an example of single-stage cluster sampling; the family constitutes the cluster.
Note that for a given number of individuals in the sample, it would undoubtedly be less costly in
terms of both travel and time to take all persons within selected families than to select the same
number of persons at random from all individuals in the population.
Often there is no list of families available, and some other procedure must be used. A possible
method is as follows. In large cities, a map showing the boundaries of city blocks can usually be
obtained; and we can select a sample of blocks. In the rest of the country, we can use maps divided
into small areas called segments, which have identifiable boundaries, and select a sample of
segments. Within the sample blocks and segments, we could include all persons in the sample;
alternatively, we could select a sample of persons living in the selected blocks. The choice would
depend upon the number of stages of sampling we believe would be most efficient. By using maps,
we eliminate the need for a list of all persons. We replace it with a list of blocks and segments and a
list of families within a sample of blocks and segments. (In practice there frequently is an earlier
stage of sampling in which a sample of cities and/or other administrative areas is selected.) The
preceding discussion illustrates an important application of cluster sampling; namely, area sampling.
However, other applications of cluster sampling are frequently made.
10.1.2
Multi-Stage Cluster Sampling
Suppose we wish to make a survey of school children in order to obtain information on their health,
or information on their knowledge of a particular subject. One way to do this is to obtain a complete
list of schools, then select a sample of schools, and finally choose a sample of children within the
selected schools. Similarly, a sample of factory workers could be selected by first choosing a sample
of factories and then interviewing a sample of workers within these factories. In both cases we
would need to construct a list of individuals only for the schools or factories selected in the sample.
These examples illustrate multi-stage (specifically, two-stage) cluster sampling. The probability that
any unit in the population is selected in the sample can be expressed as the product of the
probabilities at each stage. Thus, in the first example the probability of selecting the jth child from
the ith school is the probability of first selecting the ith school times the conditional probability of
selecting the jth child, given that the ith school has been selected. That is,
P(jth child, ith school) = P(ith school) x P(jth child
10.2
ith school).
AREA SAMPLING
Since area sampling is a frequently used application of cluster sampling, we shall describe in more
detail the methods which are usually applied. Area sampling is useful when one or both of the
following conditions exist:
(1)
When complete lists of housing units (or other desired units of observation) are not
available but maps having a reasonable amount of detail are available. Such maps can be
considered as a list covering all of the housing units in the area.
(2)
When there are large travel costs in sending an interviewer from one randomly selected
sample housing unit to another randomly selected housing unit. For a given amount of
money, we may be able to increase the number of sample housing units greatly by
grouping units together and selecting a random sample of groups.
Three simple procedures exist for drawing an area sample. We shall use city blocks as an illustration
(segments of land with identifiable boundaries around them could be used in rural areas in exactly
the same way as blocks are used in cities). We shall assume that a 1-percent sample of housing units
is to be drawn.
136
Procedure A for a sample of areas to be enumerated completely:
(1)
Obtain a reasonably accurate map of the city, showing as much detail as possible for
blocks. If the map is not new, one should take steps through local inquiry to bring it upto-date (for example, draw in new streets that have been opened since the map was
printed).
(2)
Number the blocks serially, entering the numbers directly on the map; a serpentine
numbering system is advisable in order to make certain that no blocks are omitted.
(3)
Select a simple random or systematic sample of blocks, using a 1-percent sample. If a
systematic sample is used, select a random number from 1 to 100 to determine the first
sample block, and include every one-hundredth block thereafter.
(4)
Interview all households in the sample blocks.
Procedure B for a sample of areas with subsampling of smaller areas:
The 1-percent sample can also be obtained by drawing, for example, a sample of 1 in 25 blocks, then
taking a subsample of one-fourth of the area in each sample block.
(1)
Proceed as in (1), (2), and (3) in procedure A above, except that instead of taking 1 in 100
blocks, take 1 block in 25.
(2)
Divide each of the sample blocks into 4 segments. If maps are available that show the
internal structure of each block (alleys, buildings, etc.), these can be used. If not, make a
quick and crude sketch of the sample blocks, showing each building; use this sketch as
the basis of the segmentation. The 4 segments within any block should have roughly the
same number of housing units in each.
(3)
Number the segments in each block from 1 to 4.
(4)
Select the sample segments by taking a random number from 1 to 4 for each block.
(5)
Interview all households in the selected segments.
Notice that although a 1-percent sample is obtained in both procedures, procedure B includes more
sample blocks and fewer housing units per block. Usually, it will cost more to obtain the same
sample size by procedure B, since there is a cost of subsampling not involved in procedure A; also,
travel will be increased in visiting a greater number of blocks. This subsampling procedure is almost
equivalent to dividing every block in the city into 4 parts, or segments, and taking 1 in 100 of these
segments. Hence, the use of subsampling as described above in procedure B can be regarded as
essentially equivalent to using a sample of small clusters of housing units (in which every housing
unit would be enumerated) but with two-stage sampling as a device for reducing the work of drawing
a sample of small clusters.
137
Procedure C for a sample of areas with listing and subsampling:
To carry out procedure B, it is necessary to have or to construct detailed maps. A third procedure
accomplishes approximately the same results and is frequently applicable when detailed maps are not
available and are not easy to prepare.
(1)
Proceed as in step (1) of procedure B, again selecting a sample of 1 in 25 blocks.
(2)
Visit each sample block and make a list of all the housing units in it. Number the housing
units serially. The numbering can be done (a) separately by blocks (that is, starting with 1
for each block), (b) in a single sequence throughout all the sample blocks, or © by some
combination, such as a separate sequence for various groups of blocks.
(3)
Select one-fourth of the housing units within the sample blocks either by using a random
number table, or by systematic sampling using the serial numbers assigned to the housing
units.
(4)
Interview the households whose serial numbers are selected for the sample.
Note: If advance information is available on the approximate numbers of housing units in all blocks,
some combination of the above procedures with stratification of blocks by size can be used.
10.3
CHOICE OF SAMPLING UNIT AND SAMPLE DESIGN
In designing a sample, the sampling statistician must decide how many sampling stages are to be
used. In addition, at each stage he must determine the sampling unit. In making his decision, the
statistician often has many alternatives from which to choose. Suppose, for example, that he desires
to estimate the average number of cattle per holding. Ultimately, the information must be obtained
from a sample of individual holdings (units of analysis or elementary units). In order to obtain such
a sample, however, any of the following plans could be used:
(1)
A simple random, systematic, or stratified sample of individual holdings could be taken if
complete and accurate lists of holdings were available.
(2)
Maps could be used to subdivide the country into small area segments (for example,
segments containing an average of 5 or 10 holdings). A sample of these area segments
could then be selected, and all holdings within each selected segment included in the
sample. For holdings which extend across segment boundaries, rules would be needed to
associate holdings with segments.
(3)
A sample of small administrative subdivisions, such as districts, could be selected. All
holdings in the selected districts could be included in the sample, or a subsample of
holdings could be selected.
(4)
A sample of provinces (larger administrative divisions) could be selected, and a sample of
areas and holdings within the selected provinces could be taken in one of the ways
described in procedures A, B, and C above.
138
Where subsampling is used, the cluster initially selected is called the first-stage unit or the primary
sampling unit (PSU) and the unit of subsampling is called the second-stage unit (SSU). For
example, in (3) above, if a subsample of holdings is selected, the "district" is the PSU and the
holding is the second-stage unit; in (4), the "province" is the PSU, the small area is the second-stage
unit, and still smaller areas or holdings may be third-stage units (TSU).
How can one make an intelligent choice among the various alternatives? We may reason as follows:
where cost is not important, single-stage sampling using the elementary unit (the holding in the
above case) as the sampling unit provides the most accurate results for the given number of
elementary units in the sample. (There are some exceptions, but these are rather unusual cases.) On
the other hand, when cost and administrative convenience are important, a cluster sample involving
one or more stages may be desirable. The cost of enumeration per elementary unit is usually much
less if the units are in clusters than if they are randomly distributed throughout the country; by
clustering, travel time and cost for interviewing are reduced. As a result, for a given amount of
money it may be possible, by using cluster sampling, to increase the number of elementary units in
the sample above the number that the same budget would allow if these were selected at random. If
the increase in the number of units more than compensates for the fact that a cluster sample tends to
increase the standard error, a net gain will be obtained in the reliability of estimates made from the
sample.
In order to choose among alternative sampling units, we must therefore balance the expected costs
against the standard errors for the various possible designs and use the method which will provide
the smallest standard error for a fixed cost. In some administrative situations, the correct decision
may be obvious. If the survey involves little or no travel cost--for example, if mail questionnaires
are used, or if the survey uses personnel who travel around as a normal part of their other activities,
such as policemen or postmen (mailmen)--and if listings of elementary units are available, the
elementary unit should always be taken as the sampling unit. If travel costs or the costs of
constructing lists of elementary units are rather large, an alternative design using a clustered sample
will usually be better. A full discussion of this matter is beyond the scope of these chapters, but
some of the important points will be discussed here.
10.4
ANALYSIS OF COSTS
Usually there is a fixed budget available for a survey, and one of the major functions of the sampling
statistician is to provide a method of obtaining the smallest sampling error for this budget. Let us
first examine how costs enter into a survey involving the use of cluster sampling.
In studying stratified sampling, we discussed the possibility that enumeration and processing costs
can vary from stratum to stratum, and we constructed a cost function which expressed the variable
part of the total cost as a sum of unit costs multiplied by sample sizes (for example, C = C n + C n
+ ...). A similar approach is needed for cluster sampling, although the unit costs are of a different
type. For simplicity, let us consider a two-stage sample.
1
10.4.1
1
Components of Cost
In order to analyze the costs of a two-stage cluster sample, it is necessary to identify the various
phases of the survey and to distinguish between three elements of cost:
139
2
2
(1)
Overhead costs; that is, those costs that are fixed regardless of the manner in which the
sample is selected.
(2)
Costs that depend primarily on the number of first-stage clusters in the sample, and the
way in which such costs vary as the number of these primary sampling units in the sample
varies.
(3)
The costs that depend primarily on the number of second-stage units in the sample, and
the way in which such costs vary with this number.
10.4.1.1
Overhead Costs
Overhead costs include such things as the administrative and technical work required for the survey,
rent for space and for some types of equipment, cost of printing the final results, etc. These costs
will generally be approximately the same, even with great variations in the size and design of the
survey. Since these costs are not affected by the size of the survey, they do not enter into the
decision on sample design. The only reason for separating these costs is to subtract them from the
total available budget in order to see what funds can be spent on the variable costs.
10.4.1.2
Costs of First-Stage Units
Certain costs will usually vary in proportion to the number of first-stage sampling units. These will
include (a) the cost of selecting, traveling to, and locating each first-stage unit, (b) the cost of
preparing a list of second-stage units (within the primary unit), and © the cost of designating the
subsample of second-stage units. There may also be other costs (costs of preparing maps for the
first-stage sample units, hiring special enumerators to handle each one, etc.) depending on the nature
of the administrative organization, and the materials available before the start of the survey.
10.4.1.3
Costs of Second-Stage Units
The costs which depend on the number of second-stage units will include the costs of interviewing,
reviewing the survey results, coding, recording, etc.
10.4.2
A Simple Cost Function
Let us assume a simple situation in which the cost per first-stage unit does not change despite
changes in the number of such units in the sample. Similarly, the cost per second-stage unit does not
change. Then the total variable cost (which excludes overhead costs) can be represented by
where
C1 is the cost per first-stage unit,
C2 is the cost per second-stage unit,
n is the total number of first-stage sampling units.
m is the total number of second-stage sampling units.
is the average number of second-stage units in a primary unit.
140
Using equation (10.1), one can set down combinations of n and m which would add up to the same
cost. For example, suppose the total variable cost available for a survey was $2,500, and the
estimates of C1 and C2 were $10 and $2, respectively. The table below shows various combinations
of sample sizes all of which would cost exactly $2,500; the last column shows the average size of
cluster for each allocation:
Number of units
in sample
Average
First
stage (n)
Second
stage (m)
10
1200
120
20
1150
57.5
50
1000
20
75
875
11.7
100
750
7.5
125
625
5
150
500
3.3
If the sampling error can be found for each of the above combinations, one can choose that
combination which would give the lowest sampling error. In fact, with this simple type of cost
function, it is usually possible to determine the optimum allocation mathematically. However, this is
not necessary; if a formula can be found which expresses the variance in terms of n and m, we can
easily see which combination is best. Furthermore, this can also be done in situations involving
more complex cost functions, when it is more difficult to develop a mathematical solution to the
problem of optimum allocation. The next chapter will be devoted to analyzing the variances for the
simpler and more common situations.
10.4.3
More Complex Cost Functions
One additional comment on costs should be made. The formulation of the cost function above as C
= C1 n + C2 m covers the simplest type of situation only. In practice, the cost function may be much
more complex. For example, there may be stratification for either the first-stage or the second-stage
units with different unit costs in each stratum. The cost function would then be
and the problem of the allocation of the sample would be a combination of optimum allocation for
cluster sampling with optimum allocation for stratified sampling. Frequently, the unit costs would
depend on the number of units in the sample.
141
For example, suppose that C1 included a part that resulted from the time spent traveling from one
first-stage unit to another. With only a few primary units in the sample, the average distance from
one to the next might be quite large, resulting in a high value of C1. However, as the number of units
in the sample increases, the average distance gets smaller and C1 will be smaller. A different type of
cost function would be used in such a situation. In general, in planning a large-scale and important
survey, a detailed analysis should be made of how costs vary, in order to construct a cost function
which is realistic for that particular survey.
10.5 ESTIMATION OF A POPULATION MEAN AND TOTAL
Cluster sampling is simple random sampling with each sampling unit containing a number of
elements. Hence, the estimators of the population mean, :, and total, T, are similar to those for
simple random sampling. In particular, the sample mean,
is a good estimator of the population
mean, :. An estimator of : and two estimators of T are discussed in this section.
The following notation is used in this chapter:
N
n
mi
=
=
=
the number of clusters in the population
the number of clusters selected in a simple random sample
the number of elements in cluster I, I = 1, . . . ., N.
Average cluster size for the sample.
The number of elements in the population
the average cluster size for the population
yi
=
The total of all observations in the ith cluster.
The estimator of the population mean, :, is the sample mean,
which is given by:
Thus, takes the form of a ratio estimator, as developed in Chapter 11, with mi taking the place of
xi. Then, the estimated variance of has the form of the variance of a ratio estimator:
142
Estimator of the Population Mean ::
(10.1)
Estimated variance of
(10.2)
The bound on the error of estimation is therefore
can be estimated by
if M is unknown.
The estimated variance in equation in (10.2) is biased and a good estimator of
only if n is
large, say n $ 20. The bias disappears if the cluster sizes m1, m2. . . . mN are all equal.
Estimator of the population total T:
(10.4)
Estimated variance of
(10.5)
Note that the estimator
is useful only if the number of elements in the population, M, is known.
Often the number of elements in the population is not known in problems for which cluster sampling
is appropriate. This makes it impossible to use the estimator
but we can form another estimator
of the population total which does not depend n M. The quantity
143
given by
(10.7)
is the average of th cluster totals for the n sampled clusters. Hence,
is an unbiased estimator of
the average of the N cluster totals in the population. By the same reasoning as used previously, the
estimator
is un unbiased estimator of the sum of the cluster totals or, equivalently, of the
population total, T.
For example, it is highly unlikely that the number of adult males in a city would be known, and
hence the estimator
rather than
would have to be used to estimate T.
An estimator of the population total, T, which does not depend on M:
(10.8)
The estimated variance of
(10.9)
If there is a large amount of variation among the cluster sizes and if cluster sizes are highly
correlated with cluster totals, the variance of
is generally larger than the variance of
estimator
does not use the information provided by the cluster sizes
hence, may be less precise.
The
and,
The estimators of : and T possess special properties when all cluster sizes are equal, that is, when
First, the estimator
given by equation (10.1), is an unbiased estimator of
the population mean :. Second,
variance of
given by equation (10.2), is an unbiased estimator of the
Finally, the two estimators,
and
of the population total T are equivalent.
10.6 Selecting the Sample Size for Estimating Population Means and Totals
The quantity of information in a cluster sample is affected by two factors, the number of clusters and
the relative cluster size. We have not encountered the latter factor in any of the sampling procedures
discussed previously. In the problem of estimating the number of homes with inadequate fire
insurance in a state, the clusters could be counties, voting districts, school districts, communities, or
any other convenient grouping of homes. We will assume that the relative cluster size has been
144
selected in advance and will consider the problem of choosing the number of clusters, n.
From equation (10.2), the estimated variance of
is
where
(10.11)
The actual variance of
is approximately
(10.12)
where
is the population quantity estimated by
Because we do not know
or the average cluster size,
choice of the sample size, that is, the
number of clusters necessary to purchase a specified quantity of information concerning a population
parameter, is difficult. We overcome this difficulty by using an estimate of
and
from a prior
survey or by selecting a preliminary sample containing n’ units. Thus, as in all problems of selecting
a sample size, we equate two standard deviations of our estimator to a bound on the error of
estimation, E. This bound is chosen by the experimenter and represents the maximum error that he
is willing to tolerate. That is,
Using equation (10.12), we can solve for n.
We obtain similar results when using
to estimate the population total T because
The approximate sample size required to estimate : with a bound, E, on the error of
estimation:
(10.13)
145
where
is estimated by
and
The approximate sample size required to estimate T, using
estimation:
with a bound, E, on the error of
(10.14)
where
is estimated by
and
The approximate sample size required to estimate T, using
estimation:
(10.17)
where
is estimated by
and
The estimator of the population proportion P is given by:
(10.18)
.
The estimated variance of p is:
146
with a bound, E, on the error of
(10.19)
where ai denote the total number of elements in cluster I that possess the characteristic of interest.
147
One-Stage Cluster Sampling Problems
1.
A manufacturer of band saws wants to estimate the average repair cost per month for the saws
he has sold to certain industries. He cannot obtain a repair cost for each saw, but he can obtain
the total amount spent for saw repairs and the number of saws owned by each industry. Thus,
he decides to use cluster sampling with each industry as a cluster. The manufacturer selects a
simple random sample of n = 20 from the N = 96 industries which he services. The data on
total cost of repairs per industry and number of saw per industry are as follows:
Industry (cluster)
Total repair cost
for past month
(dollars)
Number of
saws
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
3
7
11
9
2
12
14
3
5
9
8
6
3
2
1
4
12
6
5
8
50
110
230
140
60
280
240
45
60
230
140
130
70
50
10
60
280
150
110
120
3= 130
3= 2,565
Estimate the average repair cost per saw for the past month, and place a bound on the error of
estimation.
Answer: Mean = 19.73077; Error = 1.96 s(mean) = 1.78
2.
For the data in Exercise 1, estimate the total amount spent by the 96 industries on bad saw
repairs. Place a bound on the error of estimation.
Answer: Total = 12312; Error = 1.96 s(Total) = 3175.067
148
3.
After checking his sales records, the manufacturer of Exercise 1 finds that he sold a total of 710
bad saws to these industries. Using this additional information, estimate the total amount spent
on saw repairs by these industries and place a bound on the error of estimation.
4.
The same manufacturer wants to estimate the average repair cost per saw for next month. How
many clusters should he select for his sample if he wants the bound on the error of estimation
to be less than $2.00?
Answer: n = 14
5.
A political scientist developed a test designed to measure the degree of awareness of current
events. He wants to estimate the average score which would be achieved on this test by all
students in a certain high school. The administration at the school would not allow the
experimenter to randomly select students out of classes in session, but it would allow him to
interrupt a small number of classes for the purpose of giving the test to every member of the
class. Thus, the experimenter selects 25 classes at random from the 108 classes in session at a
particular hour. The test is given to each member of the sampled classes with the following
results:
Class
1
2
3
4
5
6
7
8
9
10
11
12
13
Number of
Students
Total
Score
Class
Number of
students
Total
Score
31
29
25
35
15
31
22
27
25
19
30
18
21
1590
1510
1490
1610
800
1720
1310
1427
1290
860
1620
710
1140
14
15
16
17
18
19
20
21
22
23
24
25
40
38
28
17
22
41
32
35
19
29
18
31
1980
1990
1420
900
1080
2010
1740
1750
890
1470
910
1740
Estimate the average score that would be achieved on this test by all students in the school.
Place a bound on the error of estimation.
6.
The same political scientist of Exercise 5 wants to estimate the average test score for a
similar high school. If he wants the bound on the error of estimation to be less than 2 points,
how many classes should he sample? Assume the school has 100 classes in session during
each hour.
149
Answer: n = 13
7.
An industry is considering revision of its retirement policy and wants to estimate the
proportion of employees which favor the new policy. The industry consists of 87 separate
plants located throughout the United States. Since results must be obtained quickly and with
little cost, the industry decides to use cluster sampling with each plant as a cluster. A simple
random sample of 15 plants is selected, and the opinions of the employees in these plants are
obtained by questionnaire. The results are as follows:
Plant
Number of
employees
1
2
3
4
5
6
7
8
51
62
49
73
101
48
65
49
Number favoring
new policy
Plant
42
53
40
45
63
31
38
30
9
10
11
12
13
14
15
Number of
employees
73
61
58
52
65
49
55
Number favoring
new policy
54
45
51
29
46
37
42
Estimate the proportion of employees in the industry who favor the new retirement policy
and place a bound on the error of estimation.
8.
The industry of Exercise 7 modified its retirement policy after obtaining the results of the
survey. It now wants to estimate the proportion of employees in favor of the modified
policy. How large a sample should be taken to have a bound of 0.08 on the error of
estimation? Use the data from Exercise 7 to approximate the results of the new survey.
Answer: n = 7
9.
An economic survey is designed to estimate the average amount spent on utilities for
households in a city. Since no list of households is available, cluster sampling is used with
divisions (wards) forming the clusters. A simple random sample of 20 wards is selected
from the 60 wards of the city. Interviewers then obtain the cost of utilities from each
household within the sampled wards; the total costs are tabulated below:
150
Sampled
W ard
Number of
Households
Total Amount
Spent on Utilities
55
60
63
58
71
78
69
58
52
71
2210
2390
2430
2380
2760
3110
2780
2370
1990
2810
1
2
3
4
5
6
7
8
9
10
Sampled
W ard
Number of
Households
11
12
13
14
15
16
17
18
19
20
Total Amount
Spent on Utilities
73
64
69
58
63
75
78
51
67
70
2930
2470
2830
2370
2390
2870
3210
2430
2730
2880
Estimate the average amount a household in the city spends on utilities, and place a bound
on the error of estimation.
10.
In the above survey the number of households in the city is not known. Estimate the total
amount spent on utilities for all households in the city, and place a bound on the error of
estimation.
Answer: Total = 157,020; Error = 6927.875
11.
The economic survey of Exercise 9 is to be performed in a neighboring city of similar
structure. The objective is to estimate the total amount spent on utilities by households in
the city with a bound of $5,000 on the error of estimation. Use the data in Exercise 9 to find
the approximate sample size needed to achieve this bound.
Answer: n = 30
12.
An inspector wants to estimate the average weight to fill for cereal boxes packaged in a
certain factory. The cereal is available to him in cartons containing 12 boxes each. The
inspector randomly selects 5 cartons and measures the weight of fill for every box in the
sampled cartons, with the following results (in ounces):
Carton
1
2
3
4
5
Ounces to fill
16.1
15.9
16.2
15.9
16.0
15.9
16.2
16.0
16.1
15.8
16.1
15.8
15.7
16.2
16.3
16.2
16.0
16.3
16.1
15.7
15.9
16.3
15.8
16.1
16.1
15.8
16.1
16.0
16.3
15.9
16.1
15.8
15.9
15.9
16.0
16.2
15.9
16.0
16.1
16.1
16.0
16.0
16.1
15.9
15.8
15.9
16.1
16.0
15.9
16.0
15.8
16.1
15.9
16.0
16.1
16.0
15.9
16.1
16.0
15.9
Estimate the average weight of fill for boxes packaged by this factory, and place a bound on
the error of estimation. Assume that the total number of cartons packaged by the factory is
large enough for the finite population correction to be ignored.
151
13.
A newspaper wants to estimate the proportion of voters favoring a certain candidate,
“Candidate A,” in a state-wide election. Since it is very expensive to select and interview a
simple random sample of registered voters, cluster sampling is used with precincts as
clusters. A simple random sample of 50 precincts is selected from the 497 precincts in the
state. The newspaper wants to make the estimation on election day, but before final returns
are tallied. Therefore, reporters are sent to the polls of each sample precinct to obtain the
pertinent information directly from the voters. The results are tabulated below:
No. Of
Voters
Number
Favoring
A
No. Of
Voters
Number
Favoring
A
1290
1170
840
1620
1381
1492
1785
2010
974
832
1247
1896
1943
798
1020
1141
1820
680
631
475
935
472
820
933
1171
542
457
983
1462
873
372
621
642
975
1893
1942
971
1143
2041
2530
1567
1493
1271
1873
2142
2380
1693
1661
1555
1492
1957
1143
1187
542
973
1541
1679
982
863
742
1010
1092
1242
973
652
523
831
932
No. Of
Voters
843
1066
1171
1213
1741
983
1865
1888
1947
2021
2001
1493
1783
1461
1237
1843
Number
Favoring
A
321
487
596
782
980
693
1033
987
872
1093
1461
1301
1167
932
481
999
Estimate the proportion of voters favoring Candidate A, and place a bound on the error of
estimation.
Answer: Proportion = 0.57; Error = 0.0307
14.
The same newspaper wants to conduct a similar survey during the next election. How large
a sample size will be needed to estimate the proportion of voters favoring a similar
candidate with a bound of 0.05 on the error of estimation? Use the data in Exercise 13.
Answer: n = 21
15.
A forester wishes to estimate the average height of trees on a plantation. The plantation is
divided into quarter-acre plots. A simple random sample of 20 plots is selected from the
386 plots on the plantation. All trees on the sampled plots are measured with the following
results:
152
Number of
Trees
Average Height
(feet)
Number of
Trees
Average Height
(feet)
42
51
49
55
47
58
43
59
48
41
6.2
5.8
6.7
4.9
5.2
6.9
4.3
5.2
5.7
6.1
60
52
61
49
57
63
45
46
62
58
6.3
6.7
5.9
6.1
6.0
4.9
5.3
6.7
6.1
7.0
Estimate the average height of trees on the plantation, and place a bound on the error of
estimation. (Hint: the total for cluster I can be found by taking the total number of elements
in cluster I times the cluster average).
16.
To emphasize safety, a taxi-cab company wants to estimate the proportion of unsafe tires on
their 175 cabs. (Ignore spare tires.) It is impractical to select a simple random sample of
tires, so cluster sampling is used with each cab as a cluster. A random sample of 25 cabs
gives the following number of unsafe tires per cab:
2, 4, 0, 1, 2, 0, 4, 1, 3 , 1 ,2 , 0, 1
1, 2, 2, 4, 1, 0, 0, 3, 1, 2, 2, 1.
Estimate the proportion of unsafe tires being used on the company’s cabs, and place a bound
Answer: Proportion = 0.4; Error = 0.1165
153
CHAPTER 11
CLUSTER SAMPLING VARIANCES
11.1
VARIANCE OF A TWO-STAGE CLUSTER SAMPLE
To study the variance of a two-stage cluster sample, it will be useful to review some ideas of
stratified sampling. In stratified sampling, the standard error of a sample estimate depends on the
within-stratum variances,
For each stratum, the variance
is defined by the same
formula as S² (the total variance of the population) but using only the elements in the ith stratum.
We saw that stratified sampling was most useful when the means of the strata were very
different. In fact the gains of stratified sampling can be determined by computing the standard
deviation among the means of the strata (that is, computing the standard deviation of the numbers
weighted by the number of units within each stratum) if the necessary data are
available. The square of this weighted standard deviation between cluster (primary sampling
units or PSUs, in this case) means is called the between-PSU variance.
Similar concepts can be considered in cluster sampling. In fact, there is a close analogy between
cluster and stratified sampling. In both cases we group the individual elements into sets before
selecting the sample. The difference is that in stratified sampling it is necessary to sample within
every one of the sets (the strata); in cluster sampling a sample of the sets (the clusters) is selected
and then either all or a sample of the elements within the selected sets is included. The purpose
and method of forming the sets is very different in the two cases.
11.1.1
Notation
Consider a two-stage design in which second-stage sample units (SSUs) are selected randomly
from the elementary units within selected clusters (primary sampling units or PSUs) for
interview.
N = total number of PSUs (first-stage clusters) in the population.
n = number of selected sample PSUs.
M=
m=
= total number of elementary units (second-stage units) in the population
= total number of second-stage units (SSUs) in the sample
Mi = number of SSUs in the i-th PSU (cluster), where i = 1,..., N
= avg. number of SSUs per PSU in the population or avg. cluster size
mi = number of SSUs selected for the sample in the ith PSU, i = 1,..., n
= average number of SSUs per sample PSU
= value of a characteristic for the jth elementary unit in the ith PSU in the population
= total value of the characteristic in the ith PSU in the population
= total value of the characteristic in the population
= average value of the population characteristic in the ith PSU
= average value of the characteristic per PSU (cluster) in the population
= Population mean per unit.
= value of the characteristic for the jth sample SSU in the ith sample PSU
= total value of characteristic in the ith sample PSU
= sample average of the characteristic in the ith sample PSU
= Population variance between PSU totals
= within PSU variance in the ith PSU (for population).
155
11.1.2
Estimates of Means and Totals
The formulas given in previous chapters for estimating population means are appropriate when
the sampling unit is identical with the unit of analysis. An important characteristic of cluster
sampling, however, is that the sampling unit (at least in the first stage) is not the unit of analysis.
Thus, in the examples in the previous chapter, we would probably not be interested in the mean
per family, per school, per factory, or per block. Rather, we would be interested in estimating the
mean per family member, per school child, per factory worker, or per housing unit.
Consider a two-stage design in which the second stage units are the units of analysis; n clusters
are selected from among N clusters by simple random sampling; and mi units are selected in the
ith PSU using simple random sampling for i = 1, ... , n.
Within the ith cluster, the population mean per unit is given by
(11.1)
Since the units within the cluster were selected by simple random sampling, we know (from
chapter 4, section 2) that we can estimate this mean without bias, by using the following formula:
(11.2)
These estimates of the cluster unit means from the n sample clusters must then be combined in
some way to estimate the overall population total (Y) and the population mean per unit given by
the following formula:
Several estimators are available and are discussed in most standard texts: we shall examine only
one of these.
First, we shall construct an estimator for the population total for the Y-characteristic. An
unbiased estimator for Yi, the ith PSU total is given by
(11.3)
156
An unbiased estimator for the population total is then given by
(11.4)
Similarly, we can estimate the total number of units of analysis in the population (assuming that
we do not know it) by
(11.5)
The population mean per unit is
An estimator of
is
(11.6)
As can be seen, this estimator is a weighted mean of the n sample cluster means per unit where
the weights are the corresponding cluster sizes. As indicated previously, this is only one of
several possible estimators; however, this estimator seems to be most generally useful. Since
both the numerator and denominator are random variables, this is a ratio-type estimator and it has
the usual bias of a ratio estimator. The bias will usually not be serious if the number of clusters
in the sample is reasonably large.
11.1.3
Variances
Consider the case when n PSUs are selected from a population of N PSUs and random samples
of mi (i = 1,...,n) SSUs are taken from the Mi (i=1,...,N) SSUs in the selected PSUs. Then the
variance of , the estimator of Y is
(11.7)
157
where,
(11.8)
and
(11.9)
The variance of the estimator of Y is the sum of two components. The first component is the
contribution to the variance arising from the selection of first-stage units. The second component
is the contribution from the selection of second-stage units. If there are three or more stages of
sampling, the variance will include additional terms similar in form for each additional stage.
The sample estimator of
is
(11.10)
This estimator is unbiased for
although
is not unbiased for
Now,
(11.11)
and,
(11.12)
The variance of
the estimator of
is more complex. It is given approximately by:
The approximate value of the variance of
may also be obtained from equation (11.7) as follows:
(11.13)
158
(11.14)
and is estimated using
(11.15)
If all PSUs have the same number of second-stage units M and a constant number m of them is
sampled from every sample PSU, we have
and (11.16)
In this case, the variance of
is
(11.17)
where,
and
(11.18)
The sample estimate of
is given by:
(11.19)
where, (11.20)
159
The variance of an estimated mean is
(11.21)
and is estimated using
(11.22)
11.1.3.1
Illustration
A population consists of four clusters of five households each. The second-stage units, which are
also the elementary units in this case, are houses having persons as follows:
Cluster
Household
1
2
3
4
1
3
8
4
7
2
10
3
6
2
3
9
6
3
6
4
8
4
8
4
5
6
5
6
6
First, select two clusters at random from a population of four clusters. Then within each of these
selected clusters take a random sample of three households. Compute
population total Y and the variance of
the estimate of the
Find the variance and the coefficient of variation of
the
estimate of
In this case, N = 4, n = 2, Mi=
= 5, and mi =
=3
Suppose that clusters 3 and 4 are selected at random. Assume also that households 1, 2, and 5 within
cluster 4 and households 2, 4, and 5 within cluster 3 are selected at random. Then we have,
160
Using equation (11.19),
where
, and
We have,
and
On substitution, we have,
The average number of persons per household is estimated by:
The estimated variance of this estimate is:
161
The standard error of
is:
and the coefficient of variation of
11.1.4
is:
Random Group Method of Approximating Variances
The above formulas are somewhat cumbersome. Consequently, short-cut approximations are often
used to reduce the amount of work, particularly if variance estimates are to be computed for a large
number of characteristics. One of these approximations is known as the random group method.
The random group method consists of dividing the sample into a number of groups at random; each
group is then used to make an estimate of the total, mean, etc. (this would be done for each
characteristic for which a variance is to be computed). Each of the random groups will reflect the
various steps of the sample selection so that the estimate from each group is an estimate of the total
with the same sample design as the whole sample (but with a much smaller sample size). In a multistage sample, the random groups are usually formed by placing the entire sample from a primary
sampling unit in a single group. For complex designs using stratification and/or sampling over time,
somewhat different methods are available to divide the sample into random groups. However, the
method is not very useful if the number of first-stage units is small.
In computing the estimates of variance, it is exactly the variance between different possible estimates
of the total or mean in which we are interested. Therefore, this method which provides a number of
different estimates of the total or mean, each with some degree of stability (that is, the number of
cases in a group should not be too small) is a realistic one for estimating variances.
11.2
LIMITING FORMS OF VARIANCE OF TWO-STAGE SAMPLE
Examining the variance equation (11.7) and equation (11.9), we can easily see what happens in two
simple situations. First, if all second-stage units are included in the sample we have the case
described in chapter 9 as "single-stage cluster sampling." In this case, mi = Mi and the term arising
from variation within first-stage units is zero. In equation (11.7), the first term is the same as the
variance formula for simple random sampling except that the sample sizes and values of Yi refer to
the first-stage units. For example, if area segments were the first-stage units, N is the total number of
area segments and Yi is the segment total for the variable. In equation (11.9), the first term is the
162
between cluster component of the overall variance which is based on the differences among cluster
means per unit of analysis rather than on differences among cluster totals.
Secondly, consider a situation in which all first-stage units are in the sample. In this case, n = N and
the first term becomes zero. The variance of the estimator of the population total becomes equal to
The variance of the estimator of the population mean per element is then equal to
These are the variance formulas for the estimators of totals and means from a stratified sample. In
other words, a stratified sample is simply a special case of a cluster sample in which all first-stage
units are included in the sample and a subsample of second-stage units is selected from each firststage unit.
This discussion has covered only the case of simple random sampling for both the first-stage and
second-stage selections. Analogous formulas can be developed for stratified cluster sampling in
which the only difference is that the terms in the equations are replaced by the sums of similar terms
over strata.
11.3
ANALYSIS OF COMPONENTS OF VARIANCE
A more detailed analysis of equation (11.7) and equation (11.13) would show that for a two-stage
sample containing a given total number of units of analysis, the sampling variances of estimates
computed by equation (11.4) and equation (11.6) depend on several factors. Two important factors
which the sampling statistician must consider in designing the sample are:
(1)
The variability in size of first-stage units in terms of the number of second-stage units they
contain.
(2)
The variability among second-stage units (the elementary units or units of analysis) within firststage units.
11.3.1
Variability in Size of First-Stage Units
If the first-stage units are unequal in size in terms of the number of second-stage units (for example,
the number of holdings in an area segment), these variations in size can have a profound effect on the
size of the variance of the estimator of the population total, as shown by the first term in equation
163
(11.7). We can see in equation (11.13) that the variance of the estimator of the population mean per
elementary unit is affected by the variation among first-stage means per element. If the variability in
size is very great, it will be necessary to use a large sample of first-stage units or to change the
sampling and estimating methods to keep the standard error within reasonable bounds (see section
11.4 below).
11.3.2
Variability Among Second-Stage Units
The second important factor is the variability among second-stage units (units of analysis) within
first-stage units (clusters). For a given sampling plan in which we select n out of N clusters and an
average of units of analysis out of each sample cluster, it can be shown that the greater the
variability among second-stage units within first-stage units, the smaller will be the sampling
variability of resulting estimates. In other words, it is desirable that the units of analysis have a
relatively low intraclass correlation. Intraclass correlation is a measure of similarity among units
within a cluster with regard to the characteristics being investigated.
A mathematical demonstration of this phenomenon is beyond the scope of this chapter; however, by
considering an extreme example we can gain an intuitive understanding of it. Consider a situation in
which the units of analysis within each cluster are identical. Clearly, a sampling plan such as
described above would not be efficient. A single unit of analysis within a given cluster would
provide complete information about all the units; consequently, the remaining
units would
contribute nothing additional to our knowledge. To include them in the sample would be a waste of
resources. The inefficiency of this design in this situation would be reflected in a high sampling
variability relative to a simple random sample with the same number of units of analysis.
The statistician must consider the effect of intraclass correlation on the sampling variability when
designing a sample. This is particularly true of area sampling since units which are close together
geographically are usually quite similar for many characteristics such as income, education, attitudes,
type of agricultural activity, etc. The usual approach is to limit the number of units of analysis taken
from the first-stage units and include more of the first-stage units in the sample. In a single-stage
sample, the statistician can do this by making the clusters as small as practicable. The more common
approach, however, is to introduce additional stages in the sampling procedure so that the number of
units of analysis ultimately selected from each unit at the last stage is small. The statistician must, of
course, balance precision against cost in deciding on a sampling plan.
Notice that in cluster sampling we gain by having units within clusters as unlike as possible, but in
stratified sampling we gain by having units within strata as much alike as possible. The reason for
this difference becomes clear when you recall from section 11.2 above that in stratified sampling, the
"between-cluster" component of the variance drops out of the equation entirely.
11.4
CONTROL OF VARIABILITY IN SIZE OF CLUSTER
In all of this discussion, it has been assumed that the only way we could affect the sampling variance,
with the given population, is to take more or fewer sample cases in the first or second stages or to
vary the size of the first-stage units. Of course, if the sampling variance can be reduced by
164
appropriate stratification, this should be done first. Several special procedures are also available to
control the effect of variability in size of cluster. The most important procedure is described below.
Although this discussion is related to a two-stage sample, a similar analysis could be made for three
or more stages. The procedures described below for controlling variability in size apply equally well
to first, second, or other stages, whenever cluster sampling is used.
11.4.1
Define Clusters of Equal Size
One obvious method is to attempt to define clusters in such a way that they are approximately equal
in size in terms of the number of units of analysis with the expectation that this will tend to make
them equal also in terms of characteristics being investigated. If this can be done with available
materials and information, then no other action is necessary. For example, if block counts of
numbers of housing units are available for cities and villages, it may be possible to group small
blocks together to make clusters which contain approximately the same number of housing units.
In some cases, it may be feasible to define clusters directly in terms of a characteristic being
investigated. For example, in an agricultural survey, clusters can be constructed to be nearly equal in
area. If recent aerial photographs are available, they might even be made nearly equal in terms of
cultivated area.
11.4.2
Stratify Clusters by Size
If information is available on the size of all the first-stage clusters in the universe in advance of the
survey (reasonably good approximations are adequate), it is possible to stratify the clusters by size
group. The effect of stratification is to replace a total variance by a sum of within-stratum variances.
Within each stratum, the clusters should be about equal in size; therefore, stratification by size of
cluster will have about the same effect as making all clusters in the whole population about equal in
size.
If information on size is not available, it may be worthwhile to spend a small amount of the available
resources, for example, in making a "Quick Count" of city blocks in order to obtain approximate
sizes of the first-stage units (in terms of the number of housing units they contain). Errors in counts
do not cause biases in the estimates, which are based on the actual numbers of housing units found in
the survey itself.
Either optimum or proportionate sampling can be performed depending on which appears most
useful in the particular case. If more than one characteristic is being estimated, proportionate
sampling may be preferable to optimum allocation, since the optimum allocation might be different
for each characteristic. Also, proportionate sampling is usually safer unless very good measures of
size are available, since the use of the optimum allocation formula with poor measures of size may
actually increase the variance.
11.4.3
Use of Ratio Estimates
A third method of reducing the effect of variability in cluster size is through the use of ratio
165
estimates. Ratio estimates were discussed in detail in Chapter 9; an example of the method is given
here. A ratio estimate makes use of a quantity of the form
where both and are estimates
of totals made from sample data. X, the universe total of the quantity of which is an estimate,
must be known (it may be a projection or other figure which is believed to be very close to the true
value). One can make a ratio estimate of the universe total Y--an estimate which is frequently very
efficient--by using
instead of
alone. The new estimate of
thus differs significantly from
since it involves two
items having sampling variances instead of one. Ratio estimates are generally much less sensitive to
variation in size of cluster than estimates of the type
and their use will frequently reduce the standard errors appreciably.
11.4.3.1 Ratio to Approximate Number of Units of Analysis
Two different uses of ratio estimates for this purpose will be discussed. In the first, "X" is a variable
closely related to the total number of units of analysis in the clusters and
is an estimate, based on
the sample clusters only, of the population aggregate, X. For example, consider a sample design in
which city blocks are the first-stage units, and housing units are both the second-stage units and the
units of analysis. We have rough counts (Xi) of housing units for each block based on a previous
census or special counts made for this purpose. These counts can be totaled for all blocks in the city
to obtain X. Then
a sample estimate of X, can be obtained by adding up the rough counts for the
sample blocks only, and multiplying this by
(where N is the total number of blocks in the city
and n is the number in the sample). Then, a ratio estimate of Y is
If subsampling is used within the first-stage units, the procedure would be modified. In order to
make the fullest gain with this type of ratio estimate, it is advisable not to subsample independently
within the clusters, but to treat the second-stage units within the clusters as a continuous list and
sample systematically throughout.
166
11.4.3.2 Ratio to a Correlated Statistic
In a second use of ratio estimates the true value of some universe total X is known and a sample
estimate
(of X) can be obtained in the survey. If the characteristics "Y" and "X" are positively
correlated, then
will reduce the effect of variability in cluster size (and possibly other types of
variability as well). For example, suppose a survey is planned to measure the total wage and salary
earnings of factory workers (Y). We can do this by taking a sample of factories (the clusters) and
including all workers within the sample of factories. Suppose the total sales of all factories can be
found (X) from some other source--tax records, for example. We could then include on our
questionnaire to the sample factories a question on total sales (xi) as well as wage and salary
payments (yi), and we could prepare estimates of population totals for both characteristics from the
sample in the usual manner. The ratio estimate of wages and salaries would then be
.
11.4.4 Use of Probability Proportionate to Size
A fourth method for controlling the effects of variability in cluster size is to select the sample
clusters with probability proportionate to size instead of using a simple random sample of clusters.
Probability proportionate to size is frequently abbreviated as PPS. Selection with PPS means that a
cluster which is, for example, 5 times as large as another, will have 5 times the chance to be in the
sample. It might appear, at first, that this would introduce a bias in the sample result, with some
clusters over represented and others under represented. When PPS is used, the unbiased estimate of
the total, where there is no subsampling, is
Here Yi is the total in the ith cluster in the sample and Pi is the probability of selection of this cluster.
It can easily be shown that this provides an unbiased estimate of Y.
11.4.4.1
Two-stage Sampling
A common application of sampling with PPS is the use of PPS for the selection of the first-stage
units in a two-stage sample. When this is done, the subsampling rates are usually set as inversely
proportional to size. As a result, the chance of any second-stage unit being included in the sample is
the product of the probability of the first-stage and second-stage selections. All second-stage units
therefore have identical probabilities and the sample is self-weighting.
There are a number of other advantages to this type of selection procedure; for example, the
workload can be made approximately the same for all selected first-stage units. Moreover, the
estimates will have smaller variances than those from a proportionate sample in which the first-stage
units are selected with equal probabilities.
167
11.4.4.2
Measures of Size
In order to select with PPS, it is necessary to have measures of size of each cluster in the population.
If measures of size are not available, it will usually be found worth the effort to prepare crude
estimates of size. (Rough approximations will be almost as effective as more exact measures.) Let
us assume such measures are available. The mechanics for selecting a sample with PPS can best be
described through an illustration.
11.4.4.3
Illustration
Suppose the clusters are blocks and we wish to sample the housing units in a universe made up of the
10 blocks as listed in column (1) of Table 11.1. We would list, in column (2), the measure of size
for each block (this may be a rough estimate of the number of housing units), and cumulate the
measures of size in column (3). The last figure in column (3) is the total number (rough estimate) of
housing units in all 10 blocks. Let us assume that we wish to include in the sample 5 blocks out of
the 10, and that the sample is to include 10 percent of all the housing units.
Table 11.1 SELECTION OF SAMPLE BLOCKS
Block
number
(PSU)
Measure
of size
Cumulative
Measure
Sample
designation
Probability of
selection
(Pi) = nh *(Mhi/Mh )
Within Cluster
Sampling Rate
mhi/Mhi
(1)
(2)
(3)
(4)
(5)
(6)
1
50
0 - 50
22.5
50 ÷ 60.2
60.2 ÷ 500
2
12
51 - 62
3
20
63 - 82
4
31
83 - 113
82.7
31 ÷ 60.2
60.2 ÷ 310
5
10
114 - 123
6
60
124 - 183
142.9
60 ÷ 60.2
60.2 ÷ 600
7
55
184 - 238
203.1
55 ÷ 60.2
60.2 ÷ 550
8
13
239 - 251
9
30
252 - 281
263.3
30 ÷ 60.2
60.2 ÷ 300
10
20
282 - 301
After completing the first three columns of Table 11.1 as shown, proceed as follows:
(1)
Since there are 5 blocks in the sample, divide the final cumulative measure (301) by 5; this
168
gives 60.2, which is the "sampling interval" for selecting blocks.
(2)
Choose a random number between 00.1 and 60.2; suppose the number happens to be 22.5. This
number is called the Random Start (RS).
(3)
Use this random number as the starting number and enter it in column (4), on the line whose
cumulative measure interval includes the number 22.5. In our example, the cumulative
measure interval is [0 - 50].
(4)
Add the sampling interval (60.2) to the random start (22.5), that is add 60.2 to 22.5. This
number is equal to 82.7; enter 82.7 on the line whose cumulative measure interval contains this
number. In our case, the interval is [83 - 113]. Continue adding 60.2 to the last number
obtained (82.7 in our case) and obtain the next one: 142.9. Locate the interval which contains
142.9. In our case the interval is [124 - 183]. Continue with this procedure until a number is
reached which is larger than the last cumulative measure.
(5)
The blocks with entries in column (4) are the ones in the sample. In this example, they are
blocks 1, 4, 6, 7, and 9.
(6)
The probability (Pi) of selection of each block actually selected is entered in column (5). For
each block, the probability is the measure of size in column (2) divided by the sampling
interval 60.2.
(7)
The sampling rate to be used within each selected block is computed and entered in column (6).
For each block, the rate is the desired overall probability of selection, namely 1/10, divided by
the entry in column (5). Thus, for block 1, the rate in column (6) would be
or
(8)
It occasionally happens that some of the blocks are so large that the measures of size are greater
than the sampling interval. As a result, there may be two or more entries in column (4) for the
same block. In such a case, the subsampling rate within the block is adjusted to make the
overall probability for the selection of housing units equal to 1/10, in our example.
169
Two-Stage Cluster Sampling
1.
A nurseryman wants to estimate the average height of seedlings in a large field that is divided
into 50 plots that vary slightly in size. He believes the heights are fairly constant throughout
each plot, but may vary considerably from plot to plot. Therefore, it is decided to sample 10%
of the trees within each of 10 plots using a two-stage cluster sample. The data are as follows:
Plot
Number of
seedlings
1
2
3
4
5
6
7
8
9
10
Number of
seedlings planted
52
56
60
46
49
51
50
61
60
45
5
6
6
5
5
5
5
6
6
6
Height of seedlings
(inches)
12, 11, 12, 10, 13
10, 9, 7, 9, 8, 10
6, 5, 7, 5, 6, 4
7, 8, 7, 7, 6
10, 11, 13, 12, 12
14, 15, 13, 12, 13
6, 7, 6, 8, 7
9, 10, 8, 9, 9, 10
7, 10, 8, 9, 9, 10
12, 11, 12, 13, 12, 12
Estimate the average height of seedlings in the field, and place a bound on the error of
estimation.
2.
In Exercise 1, assume that the nurseryman knows there are approximately 2600 seedlings in the
field. Use this additional information to estimate the average height, and place a bound on the
error of estimation.
3.
A supermarket chain has stores in 32 cities. A company official wants to estimate the
proportion of stores in the chain which do not meet a specified cleanliness criterion. Stores
within each city appear to possess similar characteristics; therefore, it is decided to select a
two-stage cluster sample containing one-half of the stores within each of four cities. Cluster
sampling is desirable in this situation because of travel costs. The data collected are as follows:
City
Number of
stores in city
Number of
stores sampled
Number of stores not
meeting criterion
1
2
3
4
25
10
18
16
13
5
9
8
3
1
4
2
170
Estimate the proportion of stores not meeting the cleanliness criterion, and place a bound on
the error of estimation.
4.
Repeat Exercise 3 given that the chain contains 450 stores.
5.
To improve telephone service, an executive of a certain company wants to estimate the total
number of phone calls placed by secretaries in the company during one day. The company
contains 12 departments, each making approximately the same number of calls per day. Each
department employs approximately 20 secretaries, and the number of calls made varies
considerably from secretary to secretary. It is decided to employ two-stage cluster sampling
using a small number of departments (cluster) and selecting a fairly large number of secretaries
(elements) from each. Ten secretaries are sampled from each of four departments. The data are
summarized in the following table:
Department
Number of
Secretaries
Number of secretaries
sampled
Mean
Variance
1
2
3
4
21
23
20
20
10
10
10
10
15.5
15.8
17.0
14.9
2.8
3.1
3.5
3.4
Estimate the total number of calls placed by the secretaries in this company, and place a bound
6.
A city zoning commission wants to estimate the proportion of property owners in a certain
section of a city who favor a proposed zoning change. The section is divided into 7 distinct
residential areas, each containing similar residents. Because the results must be obtained in a
short period of time, two-stage cluster sampling is used. Three of the 7 areas are selected at
random and 20% of the property owners in each area selected are sampled. The figure of 20%
seems reasonable because the people living within each area seem to be in the same
socioeconomic class and, hence, they tend to hold similar opinions on the zoning question.
The results are as follows:
Area
Number of property
owners
Number of property
owners sampled
Number in favor of
zoning change
1
2
3
46
67
93
9
13
20
1
2
2
171
Estimate the proportion of property owners who favor the proposed zoning change, and place a
bound on the error of estimation.
7.
A forester wants to estimate the total number of trees in a certain county which are infected
with a particular disease. There are ten well-defined forest areas in the country; these areas can
be subdivided into plots of approximately the same size. Four crews are available to conduct
the survey, which must be completed in one day. Hence, two-stage cluster sampling is used.
Four areas (clusters) are chosen with 6 plots (elements) randomly selected from each. (Each
crew can survey six plots in one day). The data are as follows:
Area
Number of plots
Number of plots
sampled
1
2
3
4
12
15
14
21
6
6
6
6
Number of infected
trees per plot
15, 14, 21, 13, 9, 10
4, 6, 10, 9, 8, 5
10, 11, 14, 10, 9, 15
8, 3, 4, 1, 2, 5
Estimate the total number of infected trees in the county, and place a bound on the error of
estimation.
8.
A new bottling machine is being tested by a company. During a test run, the machine fills 24
cases, each containing a dozen bottles. It is desired to estimate the average number of ounces
of fill per bottle. A two -stage cluster sample is employed using 6 cases (clusters) with 4
bottles (elements) randomly selected from each. The results are as follows:
Case
1
2
3
4
5
6
Average ounces of fill for
sample
( )
Sample Variance
7.9
8.0
7.8
7.9
8.1
7.9
0.15
0.12
0.09
0.11
0.10
0.12
(
)
Estimate the average number of ounces per bottle, and place a bound on the error of estimation.
172
9.
A population consists of four clusters. The second-stage units, which are also the elementary
units in this case, are houses having rental values as follows:
Cluster 1
Cluster 2
Cluster 3
Cluster 4
$100
$100
$10
$50
100
100
20
90
200
TOTALS
40
400
____
50
____
800
200
120
140
a.
What is the value of
(the between-cluster variance)?
b.
What is the value of the within-cluster variance for the first cluster?
c.
A sample of two clusters is selected with equal probability; within each selected
cluster, half the elementary units are in the sample.
(i)
How would you compute
the estimate of Y?
(ii)
What is the variance of the sample estimate of the total
(iii) What is the probability that any elementary unit will be in the sample?
(iv) Compute the coefficient of variation for the estimate.
10.
The following table shows areas of cacao holdings of 15 farmers in five clusters (PSUs) of
equal size. The five clusters were selected at random from a total of 40 clusters into which the
territory had been divided. Each PSU represents a geographic division containing 120 cacao
farmers.
Area of Cacao Holdings
TOTALS
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
96
110
102
140
132
134
121
113
142
162
152
146
157
161
184
382
377
372
443
478
a.
Estimate the total area of cacao for the territory.
b.
Estimate the average area of cacao per farm.
173
c.
d.
11.
Compute the standard errors for the estimates given in exercises (a) and (b).
Compute the coefficient of variation for the estimates given in exercises (a) and (b).
Assume a city with 12 blocks, as listed in the first column below. Measures of size
(approximate number of housing units in each block) are given in the second column. On the
basis of this information, we wish to select a sample of 4 blocks with probability proportionate
to size, and then to select housing units within the blocks in order to obtain a self-weighting
sample of an expected 10 housing units.
Block
Number
(PSU)
Approximate
Number of
Housing Units
(measure of size)
Cumulative
Measure
Actual Number
of Housing
Units*
Serial Numbers
of Actual
Housing Units
1
10
10
9
1 to 9
2
5
15
6
10 to 15
3
2
17
2
16 to 17
4
5
22
6
18 to 23
5
5
27
6
24 to 29
6
10
37
8
30 to 37
7
10
47
8
38 to 45
8
2
49
2
46 to 47
9
2
51
4
48 to 51
10
5
56
6
52 to 57
11
5
61
6
58 to 63
12
10
71
9
64 to 72
TOTALS
71
72
*
The number of housing units that would actually be found in the block in a field operation if the block were
selected in the sample.
a.
Prepare a worksheet showing the selection of the sample of blocks. Assume 3.7 is the
random start number for designating the sample blocks.
b.
Assume that you have visited the blocks selected in your sample and determine the actual
number of housing units as given in the fourth column above. The housing units that
actually exist in each block are designated by "Serial Numbers" as shown in the fifth
column. Perform necessary computations for selecting the sample of housing units and
list the Serial Numbers for the housing units selected in your sample.
c.
Consider the list of 600 households of 30 villages located in 3 zones (See Appendix IV).
174
Using a two-stage cluster sample design, it is desired to estimate the total number of
persons in the population. A random sample of four clusters is chosen and five
households in each sampled cluster are randomly selected. Assume
households and consider the village as the cluster (PSU) for the survey.
d.
Estimate the total number of persons
in the population.
e.
Compute the standard error for
f.
Determine the coefficient of variation for
g.
Construct a 95 percent confidence interval for
175
and interpret the result.
CHAPTER 12
NONRESPONSE
The best way to deal with nonresponse is to prevent it. After nonresponse has occurred, it is
sometimes possible to model the missing data, but predicting the missing observations is never as
good as observing them in the first place. Nonrespondents often differ in critical ways from
respondents; if the nonresponse rate is not negligible, inference based upon only the respondents may
be seriously flawed.
We discuss two types of nonresponse in this chapter: unit nonresponse, in which the entire
observation unit is missing, and item nonresponse, in which some measurements are present for the
observation unit but at least one item is missing. In a survey of persons, unit nonresponse means that
the person provides no information for the survey; item nonresponse means that the person does not
respond to a particular item on the questionnaire. In the Current Population survey and the National
Crime Victimization survey (NCVS), unit nonresponse can arise for a variety of reasons: the
interviewer may not be able to contact the households; the person may be ill and cannot respond to
the survey; the person may refuse to participate in the survey. In these surveys, the interviewer tries
to get demographic information about the nonrespondent, such as age, sex, and race, as well as
characteristics of the dwelling unit, such as urban/rural status; this information can be used later to
adjust for the nonresponse. Item nonresponse occurs largely because of refusals: a household may
decline to give information about income, for example.
In agriculture or wildlife surveys, the term missing data is generally used instead of nonresponse, but
the concepts and remedies are similar. In a survey of breeding ducks, for example, some birds will
be not be found by the researchers; they are, in a sense, nonrespondents. The nest may be raided by
predators before the investigator can determine how many eggs were laid; this is comparable to item
nonresponse.
In this chapter, we discuss four approaches to dealing with nonresponse:
1.
Prevent it. Design the survey so that nonresponse is low. This is by far the best method.
2.
Take a representative subsample of the nonrespondents; use that subsample to make
inferences about the other nonrespondents.
3.
Use a model to predict values for the nonrespondents. Weights implicitly use a model to
adjust for unit nonresponse. Imputation often adjusts for item nonresponse, and parametric
models may be used for either type of nonresponse.
4.
Ignore the nonresponse (not recommended, but unfortunately common in practice.)
12.1
Effects of Ignoring Nonresponse
Example 12.1
Thomas and Siring (1983) report results from a 1969 survey on voting behavior carried out by the
Central Bureau of Statistics in Norway. In this survey, three calls were followed by a mail survey.
The final nonresponse rate was 9.9%, which is often considered to be a small nonresponse rate. Did
the nonrespondents differ from the respondents?
In the Norwegian voting register, it was possible to find out whether a person voted in the election.
The percentage of persons who voted could then be compared for respondents and nonrespondents;
Table 12.1 shows the results. The selected sample is all persons selected to be in the sample,
including data from the Norwegian voting register for both respondents and nonrespondents.
The difference in voting rate between the nonrespondents and the selected sample was largest in the
younger age groups. Among the nonrespondents, the voting rate varied with the type of nonresponse.
The overall voting rate for the persons who refused to participate in the survey was 81%, the voting
rate for the not-at-homes was 65%, and the voting rate for the mentally and physically ill was 55%,
implying that absence or illness were the primary causes of nonresponse bias.
Table 12.1
Percentage of Persons Who Voted
Age
All
20-24
25-29
30-49
50-69
70-79
Nonrespondents
71
59
56
72
78
74
Selected Sample
88
81
84
90
91
84
It has been demonstrated repeatedly that nonresponse can have large effects on the results of a
survey–in example 12.1, a nonresponse rate of less than 10% led to an overestimate of voting rate in
Norway. Holt and Elliot discuss the results of a series of studies done on nonresponse in the United
Kingdom, indicating that “lower response rates are associated with the following characteristics:
London residents; households with no car; single people; children couples; older people;
divorced/widowed people; new Commonwealth origin; lower educational attainment; selfemployed” (1991, 334.)
Moreover, increasing the sample size without targeting nonresponse does nothing to reduce
nonresponse bias; a larger sample size merely provides more observations from the class of persons
that would respond to the survey. Increasing the sample size may actually worsen the nonresponse
bias, as the larger sample size may divert resources that could have been used to reduce or remedy
the nonresponse or it may result in less care in the data collection. Recall that the infamous Literary
Digest Survey of 1936 (see Annex 1) had 2.4 million respondents but a response rate of less than
25%. The U. S. decennial census itself does not include the entire population, and the undercoverage
rate varies for different demographic groups. In the early 1990s, the nonresponse and undercoverage
in the U. S. Census prompted a lawsuit from certain cities to force the Census Bureau to adjust for
the nonresponse, and the debate about census adjustment continues.
177
Most small surveys ignore any nonresponse that remains after callbacks and follow-ups, and report
results based on complete records only. Hite (1987) did so in her survey and much of the criticism of
her results was based on her low response rate. Nonresponse is also ignored for many surveys
reported in newspapers, both local and national.
An analysis of complete records has the underlying assumption that the nonrespondents are similar to
the respondents and that units with missing items are similar to units that have responses for every
question. Much evidence indicates that this assumption does not hold true in practice. If
nonresponse is ignored in the NCVS, for example, victimization rates are underestimated. Biderman
and Cantor (1984) find lower victimization rates for persons who respond in three consecutive
interviews than for persons who are nonrespondents in at least one of the those interviews or who
move before the panel study is completed.
Results reported from an analysis of only complete records should be taken as representative of the
population of persons who would respond to the survey, which is rarely the same as the target
population. If you insist on estimating population means and totals using only the complete records
and making no adjustment for nonrespondents, at the very least you should report the rate of
nonresponse.
The main problem caused by nonresponse is potential bias of population estimates. Think of the
population as being divided into two somewhat artificial strata of respondents and nonrespondents.
The population respondents are the units that would respond if they were chosen to be in the sample;
the number of population respondents, NR, is unknown. Similarly, the NM (M for missing)
population nonrespondents are the units that would not respond. We then have the following
population quantities:
Stratum
Size
Total
Respondents
NR
TR
Nonrespondents
NM
TM
Entire Population
N
T
Mean
Variance
The population as a whole has variance
with mean and total T. A probability sample from the population will likely contain some
respondents and some nonrespondents. But, of course, on the first call we do not observe yi for any
of the units in the nonrespondent stratum. If the population mean in the nonrespondent stratum
differs from that in the respondent stratum, estimating the population mean using only the
178
respondents will produce bias.1
Let
be an approximately unbiased estimator of the mean in the respondent stratum, using only
the respondents. Because
the bias is approximately
The bias is small if either (1) the mean for the nonrespondents is close to the mean for the
respondents or (2) (NM/N) is small–there is little nonresponse. But we can never be assured of (1), as
we generally have no data for the nonrespondents. Minimizing the nonresponse rate is the only sure
way to control nonresponse bias.
12.2
Designing Surveys to Reduce Nonsampling Errors
A common feature of poor surveys is a lack of time spent on the design and nonresponse follow-up
in the survey. Many persons new to surveys (and some, unfortunately, not new) simply jump in and
start collecting data without considering potential problems in the data-collection process; they mail
questionnaires to everyone in the target population and analyze those that are returned. It is not
surprising that such surveys have poor response rates. Many surveys reported in academic journals
on purchasing, for example, have response rates between 10 and 15%. It is difficult to see how
anything can be concluded about the population in such a survey.
A researcher who knows the target population well will be able to anticipate some of the reasons for
nonresponse and prevent some of it. Most investigators, however, do not know as much about
reasons for nonresponse as they think they do. They need to discover why the nonresponse occurs
and resolve as many of the problems as possible before commencing the survey.
These reasons can be discovered through designed experiments and application of qualityimprovement methods to the data collection and processing. You do not know why previous surveys
related to yours have a low response rate? Design an experiment to find out. You think errors are
introduced in the data recording and processing? Use a nested design to find the sources of errors.
Any book on quality control or designed experiments will tell you how to collect your data.
And, of course, you can rely on previous researchers’ experiments to help you minimize
1
The variance is often too low as well. In income surveys, for example, the rich and the poor are more
likely to be nonrespondents on the income question. In that case,
2
, for the respondent stratum, is smaller than
S . The point estimate of the mean may be biased, and the variance estimate may be biased, too.
179
nonsampling errors. The references on experiment design and quality control at the end of the book
are a good place to start; Hidoroglou et al. (1993) give a general framework for nonresponse.
Example 12.2
The 1990 U. S. decennial census attempted to survey each of the over 100 million households in the
United States. The response rate for the mail survey was 65%; households that did not mail in the
survey needed to be contacted in person, adding millions of dollars to the cost of the census.
Increasing the mail response rate for future censuses would result in tremendous savings.
Dillman et al. (1995a) report results of a factorial experiment employed in the 1992 Census
Implementation Test, designed to explore the individual effects and interactions of three
experimental factors on response rates. The three factors were:
(1)
a prenotice letter alerting the household to the impending arrival of the census form,
(2)
a stamped return envelope included with the census form, and
(3)
a reminder postcard sent a few days after the census form.
The results were dramatic, as shown in Figure 12.1. The experiment established that, although all
three factors influenced the response rate, the letter and postcard let to greater gains in response rate
than the stamped return envelope.
Figure 12.1
Response rates achieved for each combination of the factors letter, envelope, and postcard. The observed
response rate was 64.3% when all three aids were used and only 50% when non were used.
Nonresponse can have many different causes; as a result, no single method can be recommended for
every survey. Platek (1977) classifies sources of nonresponse as related to (1) survey content, (2)
methods of data collection, and (3) respondent characteristics, and illustrates various sources using
the diagram in Figure 12.2. Groves (1989) and Dillman (1978) discuss additional sources of
nonresponse.
180
Figure 12.2
Factors Affecting Nonresponse
The following are some factors that may influence response rate and data accuracy.
#
Survey content. A survey on drug use or financial matters may have a large number of
refusals. Sometimes the response rate can be increased for sensitive items by careful
ordering of the questions or by using a randomized response technique (see Section 12.5).
#
Time of survey. Some calling periods or seasons of the year may yield higher response rates
than others. The vacation month of August, for example, would be a bad time to take a onetime household survey in Germany.
#
Interviewers. Grower (1979) found a large variability in response rates achieved by different
interviewers, with about 15% of interviewers reporting almost no nonresponse. Some field
investigators in a bird survey may be better at spotting and identifying birds than others.
Standard quality-improvement methods can be applied to increase the response rate and
accuracy for interviewers. The same methods can be applied to the data-coding process.
#
Data-collection method. Generally, telephone and mail surveys have a lower response rate
and in-person surveys (they also have lower costs, however). Computer Assisted Telephone
Interviewing (CATI) has been demonstrated to improve accuracy of data collected in
telephone surveys; with CATI, all questions are displayed on a computer, and the interviewer
181
codes the responses in the computer as questions are asked. CATI is specially helpful in
surveys in which a respondent’s answer to one question determines which question is asked
next (Catlin and Ingram 1988).
Mail, fax, and Internet surveys often have low response rates. Possible reasons for
nonresponse in a mail survey should be explored before the questionnaire is mailed: Is the
survey sent to the wrong address? Do recipients discard the envelope as junk mail even
before opening it? Will the survey reach the intended recipient? Will the recipient believe
that filling out the survey is worth the time?
#
#
Questionnaire design. We have already seen that question wording has a large effect on the
responses received; it can also affect whether a person responds to an item on the
questionnaire. The volume edited by Tamur (1993) explores some recent research on
application of cognitive research on question design. In a mail survey, a well-designed form
for the respondent may increase data accuracy.
Respondent burden. Persons who respond to a survey are doing you an immense favor, and
the survey should be as nonintrusive as possible. A shorter questionnaire, requiring less
detail, may reduce the burden to the respondent. Respondent burden is a special concern in
panel surveys such as the NCVS, in which sampled households are interviewed every six
months for 3 ½ years. DeVries et al. (1996) discuss methods used in reducing respondent
burden because a smaller sample suffices to give the required precision.
#
Survey introduction. The survey introduction provides the first contact between the
interviewer and potential respondent; a good introduction, giving the recipient motivation to
respond, can increase response rates dramatically. Nielsen Media Research emphasizes to
households in its selected sample that their participation in the Nielsen ratings affects which
television shows are aired. The respondent should be told for what purpose the data will be
used (unscrupulous persons often pretend to be taking a survey when they are really trying to
attract customers or converts) and assured confidentiality.
#
Incentives and disincentives. Incentives, financial or otherwise, may increase the response
rate. Disincentives may work as well: Physicians who refused to be assessed by peers after
selection in a stratified sample from the College of Physicians and Surgeons of Ontario
registry had their medical licenses suspended. Not surprisingly, nonresponse was low
(McAuley et al. 1990).
#
Follow-up. The initial contact of the sample is usually less costly per unit than follow-ups of
the initial nonrespondents. If the initial survey is by mail, a reminder may increase the
response rate. Not everyone responds to follow-up calls, though; some persons will refuse to
respond to the survey no matter how often they are contacted. You need to decide how many
follow-up calls to make before the marginal returns do not justify the money spent.
You should try to obtain at least some information about nonrespondents that can be used later to
adjust for the nonresponse, and include surrogate items that can b e used for item nonresponse. True,
there is no complete compensation for not having the data, but partial information may be better than
none. Information about the race, sex, or age of a nonrespondent may be used later to adjust for
182
nonresponse. Questions about income may well lead to refusals, but questions about cars,
employment, or education may be answered and can be used to predict income. If the pretests of the
survey indicate a nonresponse problem that you do not know how to prevent, try to design the survey
so that at least some information is collected for each observation unit.
The quality of survey data is largely determined at the design stage. Fisher’s (1938) words about
experiments apply equally well to the design of sample surveys: “To call in the statistician after the
experiment is done may be no more than asking him to perform a postmortem examination: he may
be able to say what the experiment died of.” Any survey budget needs to allocate sufficient
resources for survey design and for nonresponse follow-up. Do not scrimp on the survey design;
every hour spent on design may save weeks of remorse later.
12.3
Callbacks and Two-Phase Sampling
Virtually all good surveys rely on callbacks to obtain responses from persons not at home for the first
try. Analysis of callback data can provide some information about the biases that can be expected
from the remaining nonrespondents.
Example 12.3
Traugott (1987) analyzed callback data from two 1984 Michigan polls on preference for presidential
candidates. The overall response rates for the surveys were about 65%, typical for large political
polls. About 21% of the interviewed sample responded on the first call; up to 30 attempts were
made to reach persons who did not respond on the first call. Traugott found that later respondents
were more likely to be male, older, and Republican than early respondents; while 48% of the
respondents who answered the first call supported Reagan and 45% supported Mondale, 59% of the
entire sample supported Reagan as opposed to 39% for Mondale. Differing procedures for
nonresponse follow-up and persistence in callback may explain some of the inconsistencies among
political polls.
If nonrespondents resemble late respondents, one might speculate that nonrespondents were more
likely to favor Reagan. But nonrespondents do not necessarily resemble the hard-to-reach; persons
who absolutely refuse to participate may differ greatly from persons who could not be contacted
immediately, and nonrespondents may be more likely to have illnesses or other circumstances
preventing participation. We also do not know how likely it is that nonrespondents to the surveys
will vote in the election; even if we speculate that they were more likely to favor Reagan, they are
not necessarily more likely to vote for Reagan.
Often, when the survey is designed so that callbacks will be used, the initial contact is by mail
survey; the follow-up calls use a more expensive method such as a personal interview.
Hansen and Hurwitz (1946) proposed subsampling the nonrespondents and using two-phase
sampling (also called double sampling) for stratification to estimate the population mean or total.
The population is divided into two strata, as described in Section 12.1; the two strata are respondents
and initial nonrespondents, persons who do not respond in the first call. WE will develop the theory
of two-phase sampling for general survey designs in Section 12.1; here, we illustrate how it can be
used for nonresponse.
183
In the simplest form of two-phase sampling, randomly select n units in the population. Of these, nR
respond and nM do not respond. The values nR and nM, though, are random variables; they will
change if a different simple random sample (SRS) is selected. Then, make a second call on a
random subsample of 100v% of the nM nonrespondents in the sample, where the subsampling
fraction v does not depend on the data collected.
Suppose that through some superhuman effort all the targeted nonrespondents are reached. Let
be the sample average of the original respondents and
(M stands for missing) be the average
of the subsampled nonrespondents. The two-phase sampling estimates of the population mean and
total are:
(12.1)
and
(12.2)
where SR represents the sampled units in the respondent stratum and SM represents the sampled units
in the nonrespondent stratum. Note that is a weighted sum of the observed units; the weights are
N/n for the respondents and N/(nv) for the subsampled nonrespondents. Because only a subsample
was taken in the nonrespondent stratum, each subsampled unit in that stratum represents more units
in the population than does a unit in the respondent stratum.
The expected value and variance of these estimators are found in Section 12.1. Because
appropriately weighted unequal-probability estimator, Theorem 6.2 implies that
is an
From
(12.5), if the finite population corrections can be ignored, we can estimate the variance by
If everyone responds in the subsample, two-phase sampling not only removes the nonresponse bias
but also accounts for the original nonresponse in the estimated variance.
12.4
Mechanisms for Nonresponse
Most surveys have some residual nonresponse even after careful design and follow-up of
nonresponse. All methods for fixing up nonresponse are necessarily model-based . If we are to
make any inferences about the nonrespondents, we must assume that they are related to respondents
in some way. A good nontechnical reference for methods of dealing with nonresponse is Groves
184
(1989); the three-volume set edited by Madow et al. (1983) contains much information on the
statistical research on nonresponse up to that date.
Dividing population members into two fixed strata of would-be respondents and would-be
nonrespondents is fine for thinking about potential nonresponse bias and for two-phase methods. To
adjust for nonresponse that remains after all other measures have been taken, we need a more
elaborate setup, letting the response or nonresponse of unit I be a random variable. Define the
random variable
After sampling, the realizations of the response indicator variable are known for the units selected in
the sample. A value for yi is recorded if ri, the realization of Ri, is 1. The probability that a unit
selected for the sample will respond,
is of course unknown but assumed positive. Rosembaum and Rubin (1983) call Mi the propensity
score for the ith unit.
Suppose that yi is a response of interest and that xi is a vector of information known about unit i in
the sample. Information used in the survey design is included in xi. We consider three types of
missing data, using the Little and Rubin (1987) terminology of nonresponse classification.
Missing Completely at Random if Mi does not depend on xi, yi, or the survey design, the missing
data are missing completely at random (MCAR). Such a situation occurs if, for example, someone
at the laboratory drops a test tube containing the blood sample of one of the survey participants–there
is no reason to think that the dropping of the test tube had anything to do with the white blood cell
count.2 If data are MCAR, the respondents are representative of the selected sample.
Missing data in the NCVS would be MCAR if the probability of nonresponse is completely unrelated
to region of the United States, race, sex, age, or any other variable measured for the sample and if the
probability of nonresponse is unrelated to any variables about victimization status. Nonrespondents
would be essentially selected at random from the sample.
If the response probabilities Mi are all equal and the events {Ri = 1} are conditionally independent of
each other and of the sample-selection process given nR, then the data are MCAR. If an SRS of size
n is taken, then under this mechanism the respondents will be a simple random subsample of variable
size nR. The sample mean of the respondents,
is approximately unbiased for the population
mean. The MCAR mechanism is implicitly adopted when nonresponse is ignored.
2
Even here, though, the suspicious mind can create a scenario in which the nonresponse might be related to
quantities of interest: perhaps workers are less likely to drop test tubes that they believe contain HIV.
185
Missing at Random Covariates, or Ignorable Nonresponse If Mi depends on xi but not on yi, the
data are missing at random (MAR); the nonresponse depends only on observed variables. We can
successfully model the nonresponse, since we know the values of xi for all sample units. Persons in
the NCVS would be missing at random if the probability of responding to the survey depends on
race, sex, and age–all known quantities–but does not vary with victimization experience within each
age/race/sex class. This is sometimes termed ignorable nonresponse: ignorable means that a model
can explain the nonresponse mechanism and that the nonresponse can be ignored after the model
accounts for it, not that the nonresponse can be completely ignored and complete-data methods used.
Nonignorable Nonresponse If the probability of nonresponse depends on the value of a response
variable and cannot be completely explained by values of the x’s, then the nonresponse is
nonignorable. This is likely the situation for the NCVS: it is suspected that a person who has been
victimized by crime is less likely to respond to the survey than a nonvictim, even if they share the
values of all known variables such as race, age, and sex. Crime victims may be more likely to move
after a victimization and thus not be included in subsequent NCVS interviews. Models can help in
this situation, because the nonresponse probability may also depend on known variables, but cannot
completely adjust for the nonresponse.
The probabilities of responding, Mi, are useful for thinking about the type of nonresponse.
Unfortunately, they are unknown, so we do not know for sure which type of nonresponse is present.
We can sometimes distinguish between MCAR and MAR by fitting a model attempting to predict
the observed probabilities of response for subgroups from known covariates; if the coefficients in a
logistic regression model are significantly different from zero, the missing data are likely not MCAR.
Distinguishing between MAR and nonignorable nonresponse is more difficult. In the next section,
we discuss a method for estimating the Mi’s.
12.5
Weighting Methods for Nonresponse
In previous chapters we have seen how weights can be used in calculating estimates for various
sampling schemes (see Sections 4.3, 5.4, and 7.2). The sampling weights are the reciprocals of the
probabilities of selection, so an estimate of the population total is
For stratification, the weights are wi = (Nh / nh) if unit i is in stratum h; for sampling elements with
unequal probabilities, wi = 1 / Bi.
Weights can also be used to adjust for nonresponse. Let Zi be the indicator variable for presence in
the selected sample, with P(Zi = 1) = Bi. If Ri is independent of Zi, then the probability that unit i will
be measured is
P(unit i selected in sample and responds) = Bi Mi.
The probability of responding, Mi, is estimated for each unit in the sample, using auxiliary
information that is known for all units in the selected sample. The final weight for a respondent is
186
then
Weighting methods assume that the response probabilities can be estimated from
variables known for all units; they assume MAR data. References for more information on
weighting are Oh and Scheuren (1983) and Holt and Elliot(1991).
12.5.1. Weighting-Class Adjustment
Sampling weights wi have been interpreted as the number of units in the population represented by
unit I of the sample. Weighting-class methods extend this approach to compensate for nonsampling
errors: variables known for all units in the selected sample are used to form weighting-adjustment
classes, and it is hoped that respondents and nonrespondents in the same weighting-adjustment class
are similar. Weights of respondents in the weighting-adjustment class are increased so that the
respondents represent the nonrespondents’ share of the population as well as their own.
Example 12.4
Suppose the age is known for every member of the selected sample and that person i in the selected
sample has sampling weight wi = (1 / Bi). Then weighting classes can be formed by dividing the
selected sample among different age classes, as Table 12.2 shows.
We estimate the response probability for each class by
Then the sampling weight for each respondent in class c is multiplied by
the weight factor in
Table 12.2. The weight of each respondent with age between15 and 24, for example, is multiplied
by 1.622. Since there was no nonresponse in the over-65 group, their weights are unchanged.
Table 12.2
Illustration of Weighting-Class Adjustment Factors
15-24
25-34
35-44
45-64
65+
Total
Sample size
202
220
180
195
203
1000
Respondents
124
187
162
187
203
863
Sum of weights for sample
30322
33013
27046
29272
30451
150104
Sum of weights for respondents
18693
28143
24371
28138
30451
0.6165
0.853
0.9011
0.961
1
1.622
1.173
1.11
1.04
1
W eight factor
187
The probability of response is assumed to be the same within each weighting class, with the
implication that within a weighting class, the probability of response does not depend on y. As
mentioned earlier, weighting-class methods assume MAR data. The weight for a respondent in
weighting class c is
To estimate the population total using weighting-class adjustments, let xci = 1 if unit i is in class c,
and 0 otherwise. Then let the new weight for respondent i be
where wi is the sampling weight for unit i;
unit i is a nonrespondent. Then,
if unit i is in class c. Assign
if
and
In an SRS, for example, if nc is the number of sample units in class c, ncR is the number of
respondents in class c, and
is the average for the respondents in class c, then
and
Example 12.5 The National Crime Victimization Survey
To adjust for individual nonresponse in the NCVS, the within-household noninterview adjustment
factor (WHHNAF) of Chapter 7 is used. NCVS interviewers gather demographic information on the
nonrespondents, and this information is used to classify all persons into 24 weighting-adjustment
cells. The cells depend on the age of the person, the relation of the person to the reference person
(head of household), and the race of the reference person.
For any cell, let WR be the sum of the weights for the respondents and WM be the sum of the weights
for the nonrespondents. Then the new weight for a respondent in a cell will be the previous weight
multiplied by the weighting-adjustment factor
188
Thus, the weights that would be assigned to nonrespondents are reallocated among respondents with
similar (we hope) characteristics.
A problem occurs if
is too large. If
the cell contains
more nonrespondents than respondents. In this case, the variance of the estimate increases; if the
number of respondents in the cell is small, the weight may not be stable. The U. S. Census Bureau
collapses cells to obtain weighting-adjustment factor of 2 or less. If there are fewer than 30
interviewed persons in a cell or if the weighting-adjustment factor is greater than 2, the cell is
combined (collapsed) with neighboring cells until the collapsed cell has more than 30 observations
and a weight-adjustment factor of 2 or less.
Construction of Weighting Classes Weighting-adjustment classes should be constructed as
though they were strata; as shown in the next section, weighting adjustment is similar to poststratification. The classes should be formed so that units within each class are as similar as possible
with respect to the major variables of interest and so that the response rates vary from class to class.
Little (1986) suggests estimating the response probabilities as a function of the known variables
(perhaps using logistic regression) and grouping observations into classes based on
This
approach is preferable to simply using the estimated values of Ni in individual case weights, as the
estimated response probabilities may be extremely variable and might cause the final estimates to be
unstable.
12.5.2 Post-stratification
Post-stratification is similar to weighting-class adjustment, except that population counts are used to
adjust the weights. Suppose an SRS is taken. After the sample is collected, units are grouped into H
different post-strata, usually based on demographic variables such as race or sex. The population has
Nh units in post-stratum h; of these, nh were selected for the sample and nhR responded. The poststratified estimator for is
the weighting-class estimator for
if the weighting classes are the post-strata, is
189
The two estimators are similar in form; the only difference is that in post-stratification the Nh are
known, whereas in weighting-class adjustments the Nh are unknown and estimated by (Nnh / n).
For the post-stratified estimator, often the conditional variance given the nhR is used. For an SRS,
(12.3)
The unconditional variance of
is slightly larger, with additional terms of order
as
given in Oh and Scheuren (1983). A variance estimator for post-stratification will be given in
Exercise 5 of Chapter 9.
12.5.2.1.
Post-stratification Using Weights
In a general survey design, the sum of the weights in subgroup h is supposed to estimate the
population count Nh for that subgroup. Post-stratification uses the ratio estimator within each
subgroup to adjust by the true population count.
Let
Then, let
Using the modified weights,
and the post-stratified estimator of the population total is
Post-stratification can adjust for undercoverage as well as nonresponse if the population count Nh
includes individuals not in the sampling frame for the survey.
190
Example 12.6
The second stage factor in the NCVS (see Section 7.6) uses post-stratification to adjust the weights.
After all other weighting adjustments have been done, including the weighting-class adjustments for
nonresponse, post-stratification is used to make the sample counts agree with estimates of the
population counts from the U. S. Census Bureau. Each person is assigned to one of 72 post-strata
based on the person’s age, race, and sex. The number of persons in the population falling in that
post-stratum, Nh, is known from other sources. Then, the weight for a person in post-stratum h is
multiplied by
With weighting classes, the weighting factor to adjust for unit nonresponse is always at least 1. With
post-stratification, because weights are adjusted so that they sum to a known population total, the
weighting factor can be any positive number, although weighting factors of 2 or less are desirable.
Post-stratification assumes that:
(1)
withing each post-stratum each unit is selected to be in the sample has the same probability of
being a respondent,
(2)
the response or nonresponse of a unit is independent of the behavior of all other units, and
(3)
nonrespondents in a post-stratum are like the respondents.
The data are MCAR within each post-stratum. These are big assumptions; to make them seem a
little more plausible, survey researchers often use many post-strata. But a large number of post-strata
may create additional problems, in that few respondents in some post-strata may result in unstable
estimates, and may preclude the application of the central limit theorem. If faced with post-strata
with few observations, most practitioners collapse the post-strata with others that have similar means
in key variables until they have a reasonable number of observations in each post-stratum. For the
Current Population Survey, a “reasonable” number means that each group has at least 20
observations and that the response rate for each group is at least 50%.
12.5.2.2.
Raking Adjustments
Raking is a post-stratification method that can be used when post-strata are formed using more than
one variable, but only the marginal population totals are known.
Raking was first used in the 1940 census to ensure that the complete census data and samples taken
from it gave consistent results and was introduced in Deming and Stephan (1940); Brackstone and
Rao (1976) further developed the theory. Oh and Schuren (1983) describe raking ratio estimates for
nonresponse.
191
Consider the following table of sums of weights from a sample; each entry in the table is the sum of
the sampling weights for persons in the sample falling in that classification (for example, the sum of
the sampling weights for black females is 300).
Female
Male
Sum of W eights
Black
W hite
Asian
Native American
Other
Sum of Weights
300150
1.2e+07
6090
3030
3030
16201380
450
2280
150
60
60
3000
Now suppose we know the true population counts for the marginal totals: we know that the
population has 1510 women and 1490 men, 600 blacks, 2120 whites, 150 Asians, 100 Native
Americans, and 30 persons in the “Other” category. The population counts for each cell in the table,
however, are unknown; we do not know the number of black females in this population and cannot
assume independence. Raking allows us to adjust the weights so that the sums of weights in the
margins equal the population counts.
First, adjust the rows. Multiply each entry by (true row population) / (estimated row population).
Multiplying the cells in the female row by 1510/1620 and the cells in the male row by 1490/1380
results in the following table:
Black
W hite
Asian
Native American
Other
Sum of Weights
Female
Male
279.63
161.96
1118.52
1166.09
55.93
97.17
27.96
32.39
27.96
32.39
15101490
Sum of W eights
441.59
2284.61
153.1
60.35
60.35
3000
The row totals are fine now, but the column totals do not yet equal the population totals. Repeat the
same procedure with the columns in the new table. The entries in the first column are each
multiplied by 600/441.59. The following table results:
Female
Male
Sum of W eights
Black
W hite
Asian
Native American
Other
Sum of Weights
379.94
220.06
1037.93
1082.07
54.79
95.21
46.33
53.67
13.90
16.10
1532.90
1467.10
600
2120
150
100
30
3000
But this has thrown the row totals off again. Repeat the procedure until both row and column totals
equal the population counts. The procedure converges as long as all cell counts are positive. In this
example, the final table of adjusted counts is
192
Female
Male
Sum of W eights
Black
W hite
Asian
Native American
Other
Sum of Weights
375.59
224.41
1021.47
1098.53
53.72
96.28
45.56
54.44
13.67
16.33
15101490
600
2120
150
100
30
3000
The entries in the last table may be better estimates of the cell populations (that is, with smaller
variance) than the original weighted estimates, simply because they use more information about the
population. The weighting-adjustment factor for each white male in the sample is 1098.53/1080; the
weight of each white male is increased a little to adjust for nonresponse and undercoverage.
Likewise, the weights of white females are decreased because they are overrepresented in the sample.
The assumptions for raking are the same as for post-stratification, with the additional assumption
that the response probabilities depend only on the row and column and not on the particular cell. If
the sample sizes in each cell are large enough, the raking estimator is approximately unbiased.
Raking has some difficulties–the algorithm may not converge if some of the cell estimates are zero.
There is also a danger of “overadjustment”–if there is little relation between the extra dimension in
raking and the cell means, raking can increase the variance rather than decrease it.
12.5.3 Estimating the Probability of Response: Other Methods
Some weighting-class methods use weights that are the reciprocal of the estimated probability of
response. A famous example is the Politz-Simmons method for adjusting for nonavailability of
sample members.
Suppose all calls are made during Monday through Friday evenings. Each nonrespondent is asked
whether he or she was at home, at the time of the interview, on each of the four preceding
weeknights. The respondent replies that she was home k of the four nights. It is then assumed that
the probability of response is proportional to the number of nights at home during interviewing
hours, so the probability of response is estimated by
The sampling weight wi for
each respondent is then multiplied by 5/(ki + 1). The respondents with k = 0 were home on only one
of the five nights and are assigned to represent their share of the population plus the share of four
persons in the sample who were called on one of their “unavailable” nights. The respondents most
likely to be home have k = 4; it is presumed that all persons in the sample who were home every
night were reached, so their weights are unchanged. The estimate of the population mean is
This method of weighting–described by Hartley (1946) and Politz and Simmons (1949)–is based on
193
the premise that the most accessible persons will tend to be overrepresented in the survey data. The
method is easy to use, theoretically appealing, and can be used in conjunction with callbacks. But it
still misses people who were not at home on any of the five nights or who refused to participate in
the survey. Because nonresponse is due largely to refusals in some telephone surveys, the PolitzSimmons method may not be helpful in dealing with all nonresponse. Values of k may also be in
error, because people may err when recalling how many evenings they were home.
Potthoff et al. (1993) modified and extended the Politz-Simmons method to determine weights based
on the number of callbacks needed, assuming that the Ni’s follow a beta distribution.
12.5.4. A Caution About Weights
The models for weighting adjustments for nonresponse are strong: in each weighting cell, the
respondents and nonrespondents are assumed to be similar. Each individual in a weighting class is
assumed equally likely to respond to the survey, regardless of the value of the response. These
models never exactly describe the true state of affairs, and you should always consider their
plausibility and implications. It is an unfortunate tendency of many survey practitioners to treat the
weighting adjustment as a complete remedy and to then act as though there was no nonresponse.
Weights may improve many of the estimates, but they rarely eliminate all nonresponse bias. If
weighting adjustments are made (and remember, making no adjustments is itself a model about the
nature of the nonresponse), practitioners should always state the assumed response model and give
evidence to justify it. Weighting adjustments are usually used for unit nonresponse, not for item
nonresponse (which would require a different weight for each item).
12.6
Imputation
Missing items may occur in surveys for several reasons: an interviewer may fail to ask a question; a
respondent may refuse to answer the question or cannot provide the information; a clerk entering the
data may skip the value. Sometimes, items with responses are changed to missing when the data set
is edited or cleaned–a data editor may not be able to resolve the discrepancies for an individual 3year old who voted in the last election and may set both values to missing.
Imputation is commonly used to assign values to the missing items. A replacement value, often
from another person in the survey who is similar to the item nonrespondent on other variables, is
imputed for the missing value. When imputation is used, an additional variable that indicates
whether the response was measured or imputed should be created for the data set.
Imputation procedures are used not only to reduce the nonresponse bias but to produce a “clean,”
rectangular data set–one without holes for the missing values. We may want to look at tables for
subgroups of the population, and imputation allows us to do that without considering the item
nonresponse separately each time we construct a table. Some references for imputation include
Sande (1983) and Kalton and Kasprzyk (1982; 1986).
Example 12.7
The Current Population Survey (CPS) has an overall high household response rate (typically well
194
above 90%), but some households refuse to answer certain questions. The nonresponse rate is about
20% on many income questions. This nonresponse would create a substantial bias in any analysis
unless some corrective action were taken: various studies suggest that the item nonresponse for the
income items is highest for low-income and high-income households. Imputation for the missing
data makes it possible to use standard statistical techniques such as regression without the analyst
having to treat the nonresponse by using specially developed methods. For surveys such as the CPS,
if imputation is to be done, the agency collecting the data has more information to guide it in filling
the missing values than does an independent analyst, because identifying information is not released
on the public-use tapes.
The CPS uses weighting for noninterview adjustment and hot-deck imputation for item nonresponse.
The sample is divided into classes using variables sex, age, race, and other demographic
characteristics. If an item is missing, a corresponding item from another unit in that class is
substituted. Usually, hot-deck imputation is done by taking the value of the missing item from a
household that is similar to the household with the missing item in some other explanatory variable
such as family size.
We use the small data set in Table 12.3 to illustrate some of the different methods for imputation.
This artificial data set is only used for illustration; in practice, a much larger data set is needed for
imputation. A “1" means the respondent answered yes to the question.
Table 12.3
Small Data Set Used to Illustrate Imputation Methods
Person
Age
Sex
Years of Education
Crime Victim?
Violent-Crime Victim?
1.235e+30
5e+39
1e+19
16
?
11
?
12
?
20
12
13
10
12
12
11
16
14
11
14
10
12
10
0
1
0
1
1
0
1
0
0
?
0
0
1
1
0
0
0
0
?
0
0
1
0
1
1
0
?
0
?
?
0
0
?
0
0
0
0
0
0
0
195
12.6.1. Deductive Imputation
Some values may be imputed in the data editing, using logical relations among the variables. In
Table 12.3, person 9 is missing the response for whether she was a victim of violent crime. But she
had responded that she was not a victim of any crime, so the violent-crime response should be
changed to 0.
Deductive Imputation may sometimes be used in longitudinal surveys. If a woman has two children
in year 1 and two children in year 3, but is missing the value for year 2, the logical value to impute
would be 2.
12.6.2. Cell Mean Imputation
Respondents are divided into classes (cells) based on known variables, as in weighting-class
adjustments. Then, the average of the values for the responding units in cell c,
is substituted
for each missing value. Cell mean imputation assumes that missing items are missing completely at
random within the cells.
Example 12.8
The four cells for our example are constructed using the variables age and sex. (In practice, of
course, you would want to have many more individuals in each cell.)
Age
Sex
M
Persons
3, 5, 10, 14
Persons
1, 7, 8, 15, 16
F
Persons
4, 12, 13, 19, 20
Persons
2, 6, 9, 11, 17, 18
Persons 2 and 6, missing the value for years of education, would be assigned the mean value for the
four women aged 35 or older who responded to the question: 12.25. The mean for each cell after
imputation is the same as the mean of the respondents. The imputed value, however, is not one of
the possible responses to the question about education.
Mean imputation gives the same point estimates for means, totals, and proportions as the weightingclass adjustments. Mean imputation methods fail to reflect the variability of the nonrespondents,
however–all missing observations in a class are given the same imputed value. The distribution of y
will be distorted because of a “spike” at the value of the sample mean of the respondents. As a
consequence, the estimated variance in the subclass will be too small.
To avoid the spike, a stochastic cell mean imputation could be used. If the response variable were
approximately normally distributed, the missing values could be imputed with a randomly generated
196
value from a normal distribution with mean
and standard deviation
Mean imputation, stochastic or otherwise, distorts relationships among different variables because
imputation is done separately for each missing item. Sample correlations and other statistics are
changed. Jinn and Sedransk (1989a; 1989b) discuss the effect of different imputation methods on
secondary data analysis–for instance, for estimating a regression slope.
12.6.3. Hot-Deck Imputation
In hot-deck imputation, as in cell mean imputation and weighting-adjustment methods, the sample
units are divided into classes. The value of one of the responding units in the class is substituted for
each missing response. Often, the values for a set of related missing items are taken from the same
donor, to preserve some of the multivariate relationships. The name hot deck is from the days when
computer programs and data sets were punched on cards–the deck of cards containing the data set
being analyzed was warmed by the card reader, so the term hot deck was used to refer to imputations
made using the same data set. Fellegi and Holt (976) discuss methods for data editing and hot-deck
imputation with large surveys.
How is the donor unit to be chose? Several methods are possible.
Sequential Hot-Deck Imputation Some hot-deck imputation procedures impute the value in the
same subgroup that was last read by the computer. This is partly a carryover from the card days of
computers (imputation could be done in one pass) and partly a belief that, if the data are arranged in
some geographic order, adjacent units in the same subgroup will tend to be more similar than
randomly chosen units in the subgroup. One problem with using the value on the previous “card” is
that often nonrespondents also tend to occur in clusters, so one person may be a donor multiple
times, in a way that the sampler cannot control. One of the other hot-deck imputation methods is
usually used today for most surveys.
In our example, person 19 is missing the response for crime victimization. Person 13 had the last
response recorded in her subclass, so the value 1 is imputed.
Random Hot-Deck Imputation A donor is randomly chosen from the persons in the cell with
information on all missing items. To preserve multivariate relationships, usually values from the
same donor are used for all missing items of a person.
In our small data set, person 10 is missing both variables for victimization. Persons 3, 5, and 14 in
his cell have responses for both crime questions, so one of the three is chosen randomly as the donor.
In this case, person 14 is chosen, and his values are imputed for both missing variables.
Nearest-Neighbor Hot-Deck Imputation Define a distance measure between observations, and
impute the value of a respondent who is “closest” to the person with the missing item, where
closeness is defined using the distance function.
If age and sex are used for the distance function, so that the person of closest age with the same sex
197
is selected to be the donor, the victimization responses of person 3 will be imputed for person 10.
12.6.4. Regression Imputation
Regression imputation predicts the missing value by using a regression of the item of interest on
variables observed for all cases. A variation is stochastic regression imputation, in which the
missing value is replaced by the predicted value from the regression model, plus a randomly
generated error term.
We only have 18 complete observations for the response crime victimization (not really enough for
fitting a model to our data set), but a logistic regression of the response with explanatory variable age
gives the following model for predicted probability of victimization,
The predicted probability of being a crime victim for a 17-year old is 0.74; because that is greater
than a predetermined cutoff of 0.5, the value 1 is imputed for person 10.
Example 12.9
Paulin and Ferraro (1994) discuss regression models for imputing income in the U. S. Consumer
Expenditure Survey. Households selected for the interview component of the survey are interviewed
each quarter for five consecutive quarters; in each interview, they are asked to recall expenditures for
the previous 3 months. The data are used to relate consumer expenditures to characteristics such as
family size and income; they are the source of reports that expenditures exceed income in certain
income classes.
The Consumer Expenditure Survey conducts about 5000 interviews each year, as opposed to about
60,000 for the NCVS. This sample size is too small for hot-deck imputation methods, as it is less
likely that suitable donors will be found for nonrespondents in a smaller sample. If imputation is to
be done at all, a parametric model needs to be adopted. Paulin and Ferraro used multiple regression
models to predict the log of family income (logarithms are used because the distribution of income is
skewed) from explanatory variables including total expenditures and demographic variables. These
models assume that income items are MAR, given the covariates.
12.6.5. Cold-Deck Imputation
In cold-deck imputation, the imputed values are from a previous survey or other information, such as
from historical data. (Since the data set serving as the source for the imputation is not the one
currently running through the computer, the deck is “cold.”) Little theory exists for the method. As
with hot-deck imputation, cold-deck imputation is not guaranteed to eliminate selection bias.
198
12.6.6. Substitution
Substitution methods are similar to cold-deck imputation. Sometimes interviewers are allowed to
choose a substitute while in the field; if the household selected for the sample is not at home, they try
next door. Substitution may help reduce some nonresponse bias, as the household next door may be
more similar to the nonresponding household than would be a household selected at random from the
population. But the household next door is still a respondent; if the nonresponse is related to the
characteristics of interest, there will still be nonresponse bias. An additional problem is that, since
the interviewer is given discretion about which household to choose, the sample no longer has
known probabilities of selection.
The 1975 Michigan Survey of Substance Abuse was taken to estimate the number of persons that
used 16 types of substances in the previous year. The sample design was a stratified multistage
sample with 2100 households. Three calls were made at a dwelling; then the house to the right was
tried, then the house to the left. From the data, evidence shows that the substance-use rate increases
as the required number of calls increases.
Some surveys select designated substitutes at the same time the sample units are selected. If a unit
does not respond, then one of the designated substitutes is randomly selected. The National
Longitudinal Study (see National Center of Educational Statistics 1977) used this method. This
stratified, multistage sample of the high school graduating class of 1972 was intended to provide data
on the educational experiences, plans, and attitudes of high school seniors. Four high schools were
randomly selected from each of 600 strata. Two were designated for the sample, and the other two
were saved as backups in case of nonresponse. Of the 1200 schools designated for the sample, 948
participated, 21 had no graduating seniors, and 231 either refused or were unable to participate.
Investigators chose 122 schools from the backup group to substitute for the nonresponding schools.
Follow-up studies showed a consistent 5% bias in a number of estimated totals, which was attributed
to the use of substitute schools and to nonresponse.
Substitution has the added danger that efforts to contact the designated units may not be as great as if
no “easy way out” was provided. If substitution is used, it should be reported in the results.
12.6.7. Multiple Imputation
In multiple imputation, each missing value is imputed m ($2) different times. Typically, the same
stochastic model is used for each imputation. These create m different “data” sets with no missing
values. Each of the m data sets is analyzed as if no imputation had been done; the different results
give the analyst a measure of the additional variance due to the imputation. Multiple imputation
with different models for nonresponse can give an idea of the sensitivity of the results to particular
nonresponse models. See Rubin (1987; 1996) for details on implementing multiple imputation.
12.6.8. Advantages and Disadvantages of Imputation
Imputation creates a “clean,” rectangular data set that can be analyzed by standard software.
Analyses of different subsets of the data will produce consistent results. If the nonresponse is
missing at random given the covariates used in the imputation procedure, imputation substantially
199
reduces the bias due to item nonresponse. If parts of the data are confidential, the data collector can
perform the imputation. The data collector has more information about the sample and population
than is released to the public (for example, the collector may know the exact address for each sample
member) and can often perform a better imputation using that information.
The foremost danger of using imputation is that future data analysis will not distinguish between the
original and the imputed values. Ideally, the imputer should record which observations are imputed,
how many times each nonimputed record is used as a donor, and which donor was used for a specific
response imputed to a recipient. The imputed values may be good guesses, but they are not real data.
Variances computed using the data together with the imputed values are always too small, partly
because of the artificial increase in the sample size and partly because the imputed values are treated
as though they were really obtained in the data collection. The true variance will be larger than that
estimated from a standard software package. Rao (1996) and Fay (1996) discuss methods for
estimating the variances after imputation.
12.7
Parametric Models for Nonresponse
Most of the methods for dealing with nonresponse assume that the nonresponse is ignorable–that is,
conditionally on measured covariates, nonresponse is independent of the variables of interest. In this
situation, rather than simply dividing units among different subclasses and adjusting weights, one
can fit a superpopulation model. From the model, then, one predicts the values of the y’s not in the
sample. The model fitting is often iterative.
In a completely model-based approach, we develop a model for the complete data and add
components to the model to account for the proposed nonresponse mechanism. Such an approach
has many advantages over other methods: the modeling approach is flexible and can be used to
include any knowledge about the nonresponse mechanism, the modeler is forced to state the
assumptions about nonresponse explicitly in the model, and some of these assumptions can be
evaluated. In addition, variance estimates that result from fitting the model account for the
nonresponse, if the model is a good one.
Example 12.10
Many people believe that spotted owls in Washington, Oregon, and California are threatened with
extinction because timber harvesting in mature coniferous forests reduces their available habitat.
Good estimates of the size of the spotted owl population are needed for reasoned debate on the issue.
In the sampling plan described by Azuma et al. (1990), a region of interest is divided into N
sampling regions (PSU’s), and an SRS of n PSU’s is selected. Let
Assume that the Yi’s are independent and that P(Yi = 1) = p, the true proportion of occupied PSU’s.
200
If occupancy could be definitively determined for each PSU, the proportion of PSU’s occupied could
be estimated by the sample proportion
While a fix number of visits can establish that a PSU is
occupied, however, a determination that a PSU is unoccupied may be wrong–some owl pairs are
“nonrespondents,” and ignoring the nonresponse will likely result in a too-low estimate of percentage
occupancy.
Azuma et al. (1990) propose using a geometric distribution for the number of visits required to
discover the owls in an occupied unit, thus modeling the nonresponse. The assumptions for the
model are:
(1)
the probability of determining occupancy on the first visit, 0, is the same for all PSU’s,
(2)
each visit to a PSU is independent, and
(3)
visits can continue until an owl is sighted.
A geometric distribution is commonly used for number of callbacks needed in surveys of people (see
Potthoff et al. 1993).
Let Xi be the number of visits required to determine whether PSU I is occupied or not. Under the
geometric model,
The budget of the U. S. Forest Service, however, does not allow for an infinite number of visits.
Suppose a maximum of s visits are to be made to each PSU. The random variable Yi cannot be
observed; the observable random variables are
Here,
counts the number of PSU’s observed to be occupied, and
counts the total
number of visits made to occupied units. Using the geometric model, the probability that an owl is
first observed in PSU I on visit k (#s) is
and the probability that an owl is observed on one of the s visits to PSU I is
201
Thus, the expected value of the sample proportion of occupied units,
is
and
is less than the proportion of interest p if 0 < 1. The geometric model agrees with the intuition that
owls are missed in the s visits.
We find the maximum likelihood estimates of p and 0 under the assumption that all PSU’s are
independent. The likelihood function
is maximized when
and when
solves
numerical methods are needed to calculate
Maximum likelihood theory also allows calculation
of the asymptotic covariance matrix of the parameter estimates.
An SRS of 240 habitat PSU’s in California had the following results:
Visit Number
1
2
3
4
5
6
Number of occupied PSU’s
33
17
12
7
7
5
A total of 81 PSU’s were observed to be occupied in six visits, so
The
average number of visits made to occupied units was
Thus, the maximum
likelihood estimates are
and
using the asymptotic covariance matrix from
maximum likelihood theory, we estimate the variance of by 0.00137. Thus, an approximate 95%
confidence interval for the proportions of units that are occupied is 0.370±0.072.
Incorporating the geometric model for number of visits gave a larger estimate of the proportion of
202
units occupied. If the model does not describe the data, however, the estimate
will still be biased;
if the model is poor, may be a worse estimate of the occupancy rate than
If, for example, field
investigators were more likely to find owls on later visits because they accumulate additional
information on where to look, the geometric model would be inappropriate.
We need to check whether the geometric model adequately describes the number of visits needed to
determine occupancy. Unfortunately, we cannot determine whether the model would describe the
situation for units in which owls are not detected in six visits, as the data are missing. We can,
however, use a
goodness-of-fit test to see whether data from the six visits made are fit by the
model. Under the model, we expect
of the PSU’s to have owls observed on visit
k, and we plug in our estimates of p and 0 to calculate expected counts:
Visit
Observed count
Expected count
1
2
3
4
5-6
33
17
12
7
12
29.66
19.74
13.14
8.75
9.71
Total
81
80.99
Visits 5 and 6 were combined into one category so that the expected cell count would be greater than
5. The
test statistic is 1.75, with p-value< 0.05. There is no indication that the model is
inadequate for the data we have. We cannot check its adequacy for the missing data, however. The
geometric model assumes observations are independent and that an occupied PSU would eventually
be determined to be occupied if enough visits were made. We cannot check whether that assumption
of the model is reasonable or not: if some wily owls will never be detected in any number of visits,
will still be too small.
To use models with nonresponse, you need
(1)
a thorough knowledge of mathematical statistics,
(2)
a powerful computer, and
(3)
knowledge of numerical methods for optimization.
Commonly, maximum likelihood methods are used to estimate parameters, and the likelihood
equations rarely have closed-form solutions. Calculation of estimates required numerical methods
even for the simple model adopted for the owls, and that was an SRS with a simple geometric model
for the response mechanism that allowed to easily write down the likelihood function. Likelihood
203
functions for more complex sampling designs or nonresponse mechanisms are much more difficult
to construct (particularly if observations in the same cluster are considered dependent), and
calculating estimates often requires intensive computations. Little and Rubin (1987) discuss
likelihood-based methods for missing data in general. Stasny (1991) gives an example of using
models to account for nonresponse.
12.8
What is an Acceptable Response Rate?
Often an investigator will say, “I expect to get a 60% response rate in my survey. Is that acceptable
and will the survey give me valid results?” As we have seen in this chapter, the answer to that
question depends on the nature of the nonresponse: if the nonrespondents are MCAR, then we can
largely ignore the nonresponse and use the respondents as a representative sample of the population.
If the nonrespondents tend to differ from the respondents, then the biases in the results from using
only the respondents may make the entire survey worthless.
Many references give advice on cutoffs for acceptability of response rates. Babbie, for example,
says: “I feel that a response rate of at least 50 percent is adequate for analysis and reporting. A
response of at least 60 percent is good. And a response rate of 70 percent is very good” (1973, 165).
I believe that giving such absolute guideline for acceptable response rates is dangerous and has led
many survey investigators to unfounded complacency about nonresponse; many examples exist of
surveys with a 70% response rate whose results are flawed. The NCVS needs corrections for
nonresponse bias even with a response rate of about 95%.
Be aware that response rates can be manipulated by defining them differently. Researchers often do
not say how the response rate was calculated or may use an estimate of response rate that is smaller
than it should be. Many surveys inflate the response rate by eliminating units that could not be
located from the denominator. Very different results for response rate accrue, depending on which
definition of response rate is used; all of the following have been used in surveys:
Number of completed interviews
Number of units in sample
Number of units contacted
completed interviews + ineligible units
contacted units
completed interviews
Contacted units - (ineligible units)
204
Number of units in sample
completed interviews
contacted units - (ineligible units) - refusals
Note that a “response rate” calculated using the last formula will be much higher than one calculated
using the first formula because the denominator is smaller.
The guidelines for reporting response rates in Statistics Canada (1993) and Hidiroglou et al (1993)
provide a sensible solution for reporting response rates. They define in-scope units as those that
belong to the target population, and resolved units as those units for which it is known whether or
not they belong to the target population.3 They suggest reporting a number of different response
rates for a survey including the following:
#
Out-of-scope rate: the ratio of the number of out-of-scope units to the number of resolved
units
#
No-contact rate: the ratio of the number of no-contacts and unresolved units to the number of
in-scope and unresolved units
#
Refusal rate: the ratio of number of refusals to the number of in-scope units
#
Nonresponse rate: the ratio of number of nonrespondent and unresolved units to the number
of in-scope and unresolved units
Different measures of response rates may be appropriate for different surveys, and I hesitate to
recommend one “fits-all” definition of response rate. The quantities used in calculating response
rate, however, should be defined for every survey. The following recommendations from the U. S.
Office of Management and Budget’s Federal Committee on Statistical Methodology, reported in
Gonzales et al. (1994), are helpful:
Recommendation 1. Survey staffs should compute response rates in a uniform fashion over time and
document response rate components on each edition of a survey.
Recommendation 2. Survey staffs for repeated surveys should monitor response rate components
(such as refusals, not-at-homes, out-of-scopes, address not locatable, post-master returns, etc.) over
time, in conjunction with routine documentation of cost and design changes.
3
If, for example, the target population is residential telephone numbers, it may be impossible to tell whether
or not a telephone that rings but is not answered belongs to the target population; such a number would be an
unresolved unit.
205
Recommendation 3. Response rate components should be published in survey reports, readers
should be given definitions of response rates used, including actual counts, and commentary on the
relevance of response rates to the quality of the survey data.
Recommendation 4. Some research on nonresponse can have real payoffs. It should be encouraged
by survey administrators as a way to improve the effectiveness of data collection operations.
206
Annex 1
Many surveys have more than one of these problems. The Literary Digest (1932, 1936a, b, c) began
taking polls to forecast the outcome of the U. S. presidential election in 1912, and their polls attained
a reputation for accuracy because they forecast the correct winner in every election between 1912
and 1932. In 1932, for example, the poll predicted that Roosevelt would receive 56% of the popular
vote and 474 votes in the electoral college; in the actual election, Roosevelt received 58% of the
popular vote and 472 votes in the electoral college.
With such a strong record of accuracy, it is not surprising that the editors of the Literary Digest had a
great deal of confidence in their polling methods by 1936. Launching the 1936 poll, they said:
The Poll represents thirty years’ constant evolution and perfection. Based on the “com m ercial
sam pling” m ethods used for m ore than a century by publishing houses to push book sales, the
present m ailing list is drawn from every telephone book in the United States, from the rosters of clubs
and associations, from city directories, lists of registered voters, classified m ail-order and occupational
data. (1936a. 3).
On October 31, the poll predicted that Republican Alf Landon would receive 55% of the popular
vote, compared with 41% for President Roosevelt. The article “Landon, 1,293,669; Roosevelt,
972,897: Final Returns in The Digest’s Poll of Ten Million Voters” contained the statement: “We
make no claim to infallibility. We did not coin the phrase ‘uncanny accuracy’ which has been so
freely applied to our Polls” (1936b). It is a good thing they made no claim to infallibility; in the
election, Roosevelt received 61% of the vote; Landon, 37%.
What went wrong? One problem may have been the undercoverage in the sampling frame, which
relied heavily on telephone directories and automobile registration lists–the frame was used for
advertising purposes, as well as for the poll. Households with a telephone or automobile in 1936
were generally more affluent than other households, and opinion of Roosevelt’s economic policies
was generally related to the economic class of the respondent. But sampling frame bias does not
explain all the discrepancy. Postmortem analyses of the poll by Squire (1988) and Calahan (1989)
indicate that even persons with both a car and a telephone tended to favor Roosevelt, though not to
the degree that persons with neither car nor telephone supported him.
The low response rate to the survey was likely the source of much of the error. Ten million
questionnaires were mailed out, and 2.3 million were returned–an enormous sample but a response
rate of less than 25%. In Allentown, Pennsylvania, for example, the survey was mailed to every
registered voter, but the survey results for Allentown were still incorrect because only one-third of
the ballots were returned. Squire (1988) reports that persons supporting Landon were much more
likely to have returned the survey; in fact, may Roosevelt supporters did not even remember
receiving a survey, even though they were on the mailing list.
One lesson to be learned from the Literary Digest poll is that the sheer size of a sample is no
guarantee of its accuracy. The Digest editors became complacent because they sent out
questionnaires to more than one quarter of all registered voters and obtained a huge sample of 2.3
million people. But large unrepresentative samples can perform as badly as small unrepresentative
samples. A large unrepresentative sample may do more damage than a small one because many
207
people think that large samples are always better than small ones. The design of the survey is far
more important than the absolute size of the sample.
What good are samples with selection bias? We prefer to have samples with no selection bias,
that serve as a microcosm of the population. When the primary interest is in estimating the total
number of victims of violent crime in the United States or the percentage of likely voters in the
United Kingdom who intend to vote for the Labour Party in the next election, serious selection bias
can cause the sample estimates to be invalid.
Purposive of judgment samples can provide valuable information, though, particularly in the early
stages of an investigation. Teichman et al. (1993) took soil samples along Interstate 880 in Alameda
County, California, to determine the amount of lead in yards of homes and in parks close to the
freeway. In taking the samples, they concentrated on areas where they thought children were likely
to play and areas where soil might easily be tracked into homes. The purposive sampling scheme
worked well for justifying the conclusion of the study, that “lead contamination of urban soil in the
east bay area of the San Francisco metropolitan area is high and exceeds hazardous waste levels at
many sites.” A sampling scheme that avoided selection bias would only be needed for this study if
the investigators wanted to generalize the estimated percentage of contaminated sites to the entire
area.
208
Annex 2
Shere Hite’s book Women and Love: A Cultural Revolution in Progress (1987) had a number of
widely quoted results:
•
•
•
•
84% of women are “not satisfied emotionally with their relationships” (p.804).
70% of all women “married five or more years are having sex outside of their marriages” (p.
856).
95% of women “report forms of emotional and psychological harassment from men with
whom they are in love relationships” (p. 810).
84% of women report forms of condescension from the men in their love relationships (p.
809).
The book was widely criticized in newspaper and magazine articles throughout the United States.
The Time magazine cover story “Back Off, Buddy” (October 12, 1987), for example, called the
conclusions of Hite’s study “dubious” and “of limited value.”
Why was Hite’s study so roundly criticized? Was it wrong for Hite to report the quotes from women
who feel that the men in their lives refuse to treat them as equals, who perhaps have never been
given the chance to speak out before? Was it wrong to report the percentages of these women who
are unhappy in their relationships with men?
Of course not. Hite’s research allowed women to discuss how they viewed their experiences, and
reflected the richness of these women’s experiences in a way that a multiple-choice questionnaire
could not. Hite’s error was in generalizing these results to all women, whether they participated in
the survey or not, and in claiming that the percentages applied to all women. The following
characteristics of the survey make it unsuitable for generalizing the results to all women.
•
The sample was self-selected–that is, recipients of questionnaires decided whether they
would be in the sample or not. Hite mailed 100,000 questionnaires; of these, 4.5% were
returned.
•
The questionnaires were mailed to such organizations as professional women’s groups,
counseling centers, church societies, and senior citizens’ centers. The members may differ in
political views, but many have joined an “all-women” group, and their viewpoints may differ
from other women in the United States.
•
The survey has 127 essay questions, and most of the questions have several parts. Who will
tend to return the survey?
•
Many of the questions are vague, using words such as love. The concept of love probably has
as many interpretations as there are people, making it impossible to attach a single
interpretation to any statistic purporting to state how many women are “in love.” Such
question wording works well for eliciting the rich individual vignettes that comprise most of
the book but makes interpreting percentages difficult.
209
•
Many of the questions are leading–they suggest to the respondent which response she should
make. For instance: “Does your husband/lover see you as an equal? Or are there times when
he seems to treat you as an inferior? Leave you out of the decisions? Act superior?” (p.
795).
Hite writes, “Does research that is not based on a probability or random sample give one the right to
generalize from the results of the study to the population at large? If a study is large enough and the
sample broad enough, and if one generalizes carefully, yes” (p. 778). Most survey statisticians
would answer Hite’s questions with a resounding no. In Hite’s survey, because the women sent
questionnaires were purposefully chosen and an extremely small percentage of the women returned
the questionnaires, statistics calculated from these data cannot be used to indicate attitudes of all
women in the United States. The final sample is not representative of women in the United States,
and the statistics can only be used to describe women who would have responded to the survey.
Hite claims that results from the sample could be generalized because characteristics such as the age,
educational, and occupational profiles of women in the sample matched those for the population of
women in the United States. But the women in the sample differed on one important aspect–they
were willing to take the time to fill out a long questionnaire dealing with harassment by men and to
provide intensely personal information to a researcher. We would expect that in every age group and
socioeconomic class, women who choose to report such information would in general have had
different experiences than women who choose not to participate in the survey.
210
Annex 3
2.7
Randomization Theory Results for Simple Random Sampling
As we have seen before,
is an unbiased estimator of
where the latter is the average of all
possible values of
if we could examine all possible SRSs of S that could be chosen. We also
calculate the variance of given by:
which can be estimated by the unbiased estimator given by:
No distributional assumptions are made about the yi’s in order to ascertain that
is unbiased for
estimating
We do not, for instance, assume that the yi’s are normally distributed with mean :.
In the randomization theory (also called design-based) approach to sampling, the yi’s are
considered to be fixed by unknown numbers–any probabilities used arise from the probabilities of
selecting units to be in the sample. The randomization theory approach provides a nonparametric
approach to inference–we need not make any assumptions about the distribution of random
variables.
Let’s see how the randomization theory works for deriving properties of the sample mean in simple
random sampling. As done in Cornfield (1944), define
Then
The Zi’s are the only random variables in the above equation because, according to randomization
theory, the yi’s are fixed quantities. When we choose an SRS of n units out of the N units in the
population, {Z1, . . . , ZN} are identically distributed Bernoulli random variables with
211
(2.18)
The probability in (2.18) follows from the definition of an SRS. To see this, note that if unit I is in
the sample, then the other (n – 1) units in the sample must be chosen from the (N - 1) units in the
population.
A total of
possible samples of size (n - 1) may be drawn from a population of size (N - 1),
so
As a consequence of Equation (2.18),
and
The variance of
is also calculated using properties of the random variables Z1, . . . , ZN. Note that
For
Because the population is finite, the Zi’s are not quite independent–if we know that unit I is in the
sample, we do have a small amount of information about whether unit j is in the sample, reflected in
the conditional probability P(Zj = 1 * Zi = 1). Consequently, for
212
We use the covariance (Cov) of Zi and Zj to calculate the variance of
see Appendix B for
properties of covariances. The negative covariance of Zi and Zj is the source of the fpc.
To show that the estimator
of
is an unbiased estimator
we need to show that the E[s2] = S2. The argument proceeds much like
the previous one.
Since
it makes sense when trying to find an unbiased estimator to find the
213
expected value of
and then find the multiplicative constant that will give the
unbiasedness:
Thus,
2.8
A Model for Simple Random Sampling
Unless you have studied randomization theory in the design of experiments, the proofs in the
preceding section probably seemed strange to you. The random variables in randomization theory
are not concerned with the responses yi: they are simply random variables that tell us whether the ith
unit is in the sample or not. In a design-based, or randomization theory, approach to sampling
inference, the only relationship between units sampled and units not sampled is that the nonsampled
units could have been sampled had we used a different starting value for the random number
generator.
In Section 2.7 we found properties of the sample mean
were considered to be fixed values, and
using randomization theory: Y1, Y2, . . . , YN
is unbiased because the average of
for all possible
samples S equals
The only probabilities used in finding the expected value and variance of
are the probabilities used in finding the expected value and variance of are the probabilities that
units are included in the sample.
In your basic statistics class, you learned a different approach to inference. There, you had random
variable {Yi} that followed some probability distribution, and the actual sample values were
realizations of those random variables. Thus you assumed, for example, that y1, y2, . . ., yn were
independent and identically distributed from a normal distribution with mean : and variance F2 and
used properties of independent random variables and the normal distribution to find expected values
of various statistics.
214
We can extend this approach to sampling by thinking of random variables y1, y2, . . ., yn generated
from some model. The actual values for the finite population is that the sample is one realization of
the random variables. The joint probability distribution of Y1, Y2, . . . , YN supplies the link between
units in the sample and units not in the sample in this model-based approach–a link that is missing
in the randomization approach. Here, we sample
Thus, problems in finite population
sampling may be thought of as prediction problems.
215
CHAPTER 13
VARIANCE ESTIMATION IN COMPLEX SURVEYS
_________________________________________________________________________
Population means and totals are easily estimated using weights. Estimating variances is more
intricate. We noted before that in a complete survey with several levels of stratification and
clustering, variances for estimated means and totals are calculated at each level and then combined as
the survey design is ascended. Poststratification and nonresponse adjustment also affect the
variance.
In previous chapters, we have presented and derived variance formulas for a variety of sampling
plans. Some of the variance formulas, such as those for simple random samples (SRSs), are
relatively simple. Other formulas, such as
from a two-stage cluster sample without
replacement, are more complicated. All work for estimating variances of estimated totals. But we
often want to estimate other quantities from survey data for which we have presented no variance
formula. For example, in Chapter 3 we derived an approximate variance for a ratio of two means
when an SRS is taken. What if you want to estimate a ratio, but the survey is not as SRS? How
would you estimate the variance?
This chapter describes several methods for estimating variances of estimated totals and other
statistics from complex surveys. Section 13.1 describes the commonly used linearization method for
calculating variances of nonlinear statistics. Sections 13.2 and 13.3 present random group and
resampling methods for calculating variances of linear and nonlinear statistics. Section 13.4
describes the calculation of generalized variances functions, and Section 13.5 describes constructing
confidence intervals. These methods are described in more detail by Wolter (1985) and Rao (1988);
Rao (1997) and Rust and Rao (1996) summarize recent work.
13.1
Linearization (Taylor Series) Methods
Most of the variance formulas in Chapters 2 through 6 were for estimates of means and totals. Those
formulas can be used to find variances for any linear combination of estimated means and totals. If
are unbiased estimates of k totals in the population, then
(13.1)
The result can be expressed equivalently using unbiased estimates of k means in the population:
Thus, if T1 is the total number of dollars robbery victims reported stolen, T2 is the number of days of
work robbery victims missed because of the crime, and T3 is the total medical expenses incurred by
robbery victims, one measure of financial consequences of robbery (assuming $150 per day of work
lost) might be
By (13.1), the variance is:
This expression requires calculation of six variances and covariances; it is easier computationally to
define a new variable at the observation unit level.
and find
directly.
Suppose, though, that we are interested in the proportion of total loss accounted for by the stolen
property, T1/Tq. This is not a linear statistic, as T1/Tq cannot be expressed in the form a1T1+a2Tq for
constants ai. But Taylor’s theorem from calculus allows us to linearize a smooth nonlinear function
h(T1, T2, . . . , Tk) of the population totals; Taylor’s theorem gives the constants ao, a1, . . . , ak so that
Then
may be approximated by
which we know how to calculate
using (13.1).
Taylor series approximations have long been used in statistsics to calculate approximate variances.
Woodruff (1971) illustrates their use in complex surveys. Binder (1983) gives a more rigorous
treatment of Taylor series methods for complex surveys and tells how to use linearization when the
parameter of interest 2 solves h(2, T1, . . ., Tk) = 0, but 2 is not necessarily expressed as an explicit
function of T1, . . ., Tk.
Example 13.1
The quantity 2 = P(1-P), where P is a population proportion, may be estimated by
Assume that p is an unbiased estimator of P and that V(p) is known.
Let h(x) = x(1-x), so 2 = h(p) and
Now h is a nonlinear function of x, but the function can
be approximated at any nearby point a by the tangent line to the function; the slope of the tangent
line is given by the derivative, as illustrated in Figure 13.1.
217
Figure 13.1
The function h(x) = x(1-x), along with the tangent to the function at point P. If p is close to P, the h(p) will be
close to the tangent line. The slope of the tangent line is h’(P) = 1 - 2P.
The first-order version of Taylor’s theorem states that if the second derivative of h is continuous,
then
under conditions commonly satisfied in statistics, the last term is small relative to the first two, and
we use the approximation
Then,
and V(p) is known, so the approximate variance of h(p) can be calculated.
The following are the basic steps for constructing a linearization estimator of the variance of a
nonlinear function of means or totals:
1.
Express the quantity of interest as a function of means or totals of variables measured or
computed in the sample. In general, 2 = h(T1, T2, . . . , Tk) or
In Example
13.1,
218
2.
Find the partial derivatives of h with respect to each argument. The partial derivatives,
evaluated at the population quantities, for the linearizing constants ai.
3.
Apply Taylor’s theorem to linearize the estimate:
Where
4.
Define the new variable q by
Now find the estimated variance of
This will generally approximate the
variance of h(T1, T2, . . . , Tk).
Example 13.2
We used linearization methods to approximate the variance of the ratio and regression estimators in
Chapter 3. In Chapter 3, we used an SRS, estimator
and the approximation
The resulting approximation to the variance was
Essentially, we used Taylor’s theorem to obtain this approximation. The steps below give the same
result.
1.
Express B as a function of the population totals. Let h(c,d) = d/c, so
219
Assume that the sample estimates
2.
are unbiased.
The partial derivatives are
Evaluated at c = Tx and d = Ty, these are
3.
By Taylor’s Theorem,
Using the partial derivatives from step 2,
4.
The approximate mean squared error of
is
(13.2)
220
We can substitute values for B, for the variances and covariance, and possibly for Tx from the
particular sampling scheme used into (13.2). Alternatively, we would define
And find
If the sampling design is an SRS of size n, then
and
Advantages If the partial derivatives are known, linearization almost always gives a variance
estimate for a statistic and can be applied in general sampling designs. Linearization methods have
been used for a long time in statistics, and the theory is well developed. Software exists for
calculating linearization variance estimates for many nonlinear functions of interest, such as ratios
and regression coefficients; some software will be discussed in Section 13.6.
Disadvantages Calculations can be messy, and the method is difficult to apply for complex
functions involving weights. You must either find analytical expressions for the partial derivatives
of h or calculate the partial derivatives numerically. A separate variance formula is needed for each
nonlinear statistic that is estimated, and that can require much special programming; a different
method is needed for each statistic. In addition, not all statistics can be expressed as a smooth
function of the population totals–the median and other quantiles, for example, do not fit into this
framework. The accuracy of the linearization approximation depends on the sample size–the
estimate of the variance is often biased downward if the sample is not large enough.
13.2
Random Group Methods
13.2.1. Replicating the Survey Design
Suppose the basic survey design is replicated independently R times. Independently here means that
after each sample is drawn, the sampled units are replaced in the population so that they are available
for later samples. Then, the R replicate samples produce R independent estimates of the quantity of
221
interest; the variability among those estimates can be used to estimate the variance of
Mahalanobis (1946) describes early uses of the method, which he calls “replicated networks of
sample units” and “interpenetrating sampling.”
Let
If
is an unbiased estimate of 2, so is
and
(13.3)
is an unbiased estimate of
Note that
is the sample variance of the R independent
estimates of 2 divided by R–the usual estimate of the variance of a sample mean.
Example 13.3
The 1991 Information Please Almanac listed enrollment, tuition, and room-and-board costs for every
4-year college in the United States. Suppose we want to estimate the ratio of nonresident tuition to
resident tuition for public colleges and universities in the United States. In a typical implementation
of the random group method, independent samples would be chosen using the same design and
found for each sample. Let’s take four SRSs of size 10 each (Table 13.1). The four SRS are
without replacement, but the same college can appear in more than one of the four SRSs.
For this example,
222
Table 13.1: Four SRSs of Colleges, Used in Example 13.3
College
Enrollment
Resident Tuition
Nonresident Tuition
3.483e+41
1.3651677e+38
3,747
4,983
1,500
2,160
2,475
5.135
3,950
4,050
4,140
4,166
Average
6934.2
1559
3630.6
College
Enrollment
Resident Tuition
Nonresident Tuition
4.696e+41
1.4951350e+39
4.0953342474e+39
Average
6968.6
1505.2
3883.7
College
Enrollment
Resident Tuition
Nonresident Tuition
4.092e+39
9.4117101e+37
2.7654140288e+39
Average
4790.2
1527.5
3756.3
College
Enrollment
Resident Tuition
Nonresident Tuition
6.398e+42
1.6741296e+37
5.7123792759e+39
8613
1527.1
4750.8
Columbus College
Southeastern Massachusetts University
U. S. Naval Academy
Athens State College
University of South Alabama
Virginia State University
SUNY College of Technology-Farmingdale
University of Houston
CUNY-Lehman College
Austin Peay State University
SUNY-New Paltz
Indiana University-Southeast
University of Wisconsin-Platteville
University of California-Santa Barbara
W eber State College
Kennesaw College
South Dakota State University
Dickinson State University
Chadron State College
University of Alaska-Fairbanks
University of Alaska-Anchorage
University of Maine-Fort Kent
Southern University-Baton Rouge
University of Oregon
Virginia State University
Glenville State College
W inston-Salem State University
Framingham State College
SUNY-Old W estbury
Northwest Missouri State University
Central W ashington University
W orcester State College
University of California-Davis
Sam Houston State University
University of Texas-Tyler
Southerneastern Oklahoma State University
University of Southern Colorado
Pennsylvania State University
East Central University
Univ of Arkansas-Monticello
Average
223
Thus,
The sample average of the four
independent estimates of 2 is
The sample standard deviation (SD) of the four estimates
is 0.343, so the standard error (SE) of is
The estimated variance is based on
four independent observations, so a 95% confidence interval (CI) for the ratio is
2.6198 ± 3.18 (0.172)
where 3.18 is the appropriate t critical value with 3 degrees of freedom (df). Note that the small
number of replicates causes the confidence interval to be wider than it would be if more replicate
samples were taken, because the estimate of the variance with 3 df is not very stable.
13.2.2. Dividing the Sample into Random Groups
In practice, subsamples are not usually drawn independently, but the complete sample is selected
according to the survey design. The complete sample is then divided into R groups so that each
group forms a miniature version of the survey, mirroring the sample design. The groups are then
treated as though they are independent replicates of the basic survey design.
If the sample is an SRS of size n, the groups are formed by randomly apportioning the n observations
into R groups, each of size n/R. These pseudo-random groups are not quite independent replicates
because an observation unit can only appear in one of the groups; if the population size is large
relative to the sample size, however, the groups can be treated as though they are independent
replicates. In a cluster sample, the PSUs are randomly divided among the R groups. The PSU takes
all its observations units with it to the random group, so each random group is still a cluster sample.
In a stratified multistage sample, a random group contains a sample of PSUs from each stratum.
Note that if k PSUs are sampled in the smallest stratum, at most k random groups, can be formed.
If 2 is a nonlinear quantity,
will not, in general, be the same as
the estimator calculated directly
from the complete sample. For example, in ratio estimation,
Usually,
is a more natural estimator than
Sometimes
while
from (13.3) is used to
estimate
although it is an overestimate. Another estimator of the variance is slightly larger
but is often used:
(13.4)
Example 13.4
The 1987 Survey of Youth in Custody, discussed in Example 7.4, was divided into seven random
224
groups. The survey design had 16 strata. Strata 6-16 each consisted of one facility (= PSU), and
these facilities were sampled with probability 1. In strata 1-5, facilities were selected with
probability proportional to number of residents in the 1985 Children in Custody census.
It was desired that each random group be a miniature of the sampling design. For each selfrepresenting facility in strata 6-16, random group numbers were assigned as follows: the first
resident selected from the facility was assigned a number between 1 and 7. Let’s say the first
resident was assigned number 6. Then the second resident in that facility would be assigned number
7, the third resident 1, the fourth resident 2, and so on. In strata 1-5, all residents in a facility (PSU)
were assigned to the same random group. Thus, for the seven facilities sampled in stratum 2, all
residents in facility 33 were assigned random group number 1, all residents in facility 9 were
assigned random group 2 (etc.). Seven random groups were formed because strata 2-5 each have
seven PSUs.
After all random group assignments were made, each random group had the same basic design as the
original sample. Random group 1, for example, forms a stratified sample in which a (roughly)
random sample of residents is taken from the self-representing facilities in strata 6-16, and a pps
(probability proportional to size) sample of facilities is taken from each of strata 1-5.
To use the random group method to estimate a variance, is calculated for each random group. The
following table shows estimates of mean age of residents for each random group; each estimate was
calculated using
where wi is the final weight for resident I and the summations are over observations in random group
r.
Random Group Number
Estimate of Mean Age,
1234567
The seven estimates,
16.55
16.66
16.83
16.06
16.32
17.03
17.27
are treated as independent observations, so
and
225
Using the entire data set, we calculate
We can use either
with
or to calculate confidence intervals; using
a 95% CI for mean age is
(2.45 is the t critical value with 6 df).
Advantages No special software is necessary to estimate the variance, and it is very easy to
calculate the variance estimate. The method is well suited to multiparameter or nonparametric
problems. It can be used to estimate variances for percentiles and nonsmooth functions, as well as
variances of smooth functions of the population totals. Random group methods are easily used after
weighting adjustments for nonresponse and undercoverage.
Disadvantages The number of random groups is often small–this gives imprecise estimates of the
variances. Generally, you would like at least ten random groups to obtain a more stable estimate of
the variance and to avoid inflating the confidence interval by using the t distribution rather than the
normal distribution. Setting up the random groups can be difficult in complex designs, as each
random group must have the same design structure as the complete survey. The survey design may
limit the number of random groups that can be constructed; if two PSUs are selected in each stratum,
then only two random groups can be formed.
13.3
Resampling and Replication Methods
Random group methods are easy to compute and explain but are unstable if a complex sample can
only be split into a small number of groups. Resampling methods treat the sample as if it were itself
a population; we take different samples from this new “population” and use the subsamples to
estimate a variance. All methods in this section calculate variance estimates for a sample in which
PSUs are sampled with replacement. If PSUs are sampled without replacement, these methods may
still be used but are expected to overestimate the variance and result in conservative confidence
intervals.
13.3.1. Balanced Repeated Replication (BRR)
Some surveys are stratified to the point that only two PSUs are selected from each stratum. This
226
gives the highest degree of stratification possible while still allowing calculation of variance
estimates in each stratum.
13.3.1.1.
BRR In a Stratified Random Sample
We illustrate BRR for a problem we already know how to solve–calculating the variance for
from a stratified multistage sample. More complicated statistics from stratified multistage
samples are discussed in Section 13.3.1.2.
Suppose an SRS of two observation units is chosen from each of seven strata. We arbitrarily label
one of the sampled units in stratum h as yh1 and the other as yh2. The sampled values are given in
Table 13.2.
Table 13.2: A Small Stratified Random Sample, Used to Illustrate BRR
Stratum
N h/N
y h1
y h2
1e+06
.30
.10
.05
.10
.20
.05
.20
2e+27
2e+29
y h1 - yh2
1.9e+29
208
-210
-4,510
-450
2,036
446
36
The stratified estimate of the population mean is
Ignoring the fpc’s (finite population correction) in Equation (4.5) gives the variance estimate
when nh = 2, as here,
Here,
replacement.
so
This may overestimate the variance if sampling is without
227
To use the random group method, we would randomly select one of the observations in each stratum
for group 1 and assign the other to group 2. The groups in this situation are half-samples. For
example, group 1 might consist of {y11, y22, y32, y42, y51, y62, y71} and group 2 of the other seven
observations. Then,
and
The random group estimate of the variance–in this case, 139,129–has only 1 df for a two-psu-perstratum design and is unstable in practice. If a different assignment of observations to groups had
been made–had, for example, group 1 consisted of yh1 for strata 2, 3, and 5 and yh2 for strata 1, 4, 6
and 7–then
3238.
and the random group estimate of the variance would have been
McCarthy (1966; 1969) notes that altogether 2H possible half-samples could be formed and suggests
using a balanced sample of the 2H possible half-samples to estimate the variance. Balanced
repeated replication uses the variability among R replicate half-samples that are selected in a
balanced way to estimate the variance of
To define balance, let’s introduce the following notation. Half-sample r can be defined by a vector
"r
Let
yh("r)
Equivalently,
yh("r) =
If group 1 contains observations {y11, y22, y32, y42, y51, y62, y71} as above, then "1 = (1, -1, -1, -1, 1, -1,
1). Similarly " 2 = (-1, 1, 1, 1, -1, 1, -1). The set of R replicate half-samples is balanced if
Let
"r) be the estimate of interest, calculated the same way as but using only the observations
in the half-sample selected by "r. For estimating the mean of a stratified sample,
228
" r) =
yh("r).
Define the BRR variance estimator to be
"r)-
.
If the set of half-samples is balanced, then
(The proof of this is left as
" r) =
for h = 1, . . . , H, then
Exercise 6.) If, in addition
For our example, the set of "’s in the following table meets the balancing condition
for all l h. The 8 x 7 matrix of -1's and 1's has orthogonal columns; in fact, it is the design matrix
(excluding the column of 1's) for a fractional factorial design (Box et al. 1978). Designs described
by Plackett and Burman (1946) give matrices with k orthogonal columns, for k a multiple of 4;
Wolter (1985) explicitly lists some of these matrices.
Stratum (h)
Half-Sample
®)
The estimate from each half-sample,
"1
"2
"3
"4
"5
"6
"7
"8
1
2
3
4
5
6
7
-1
1
-1
1
-1
1
-1
1
-1
-1
1
1
-1
-1
1
1
-1
-1
-1
-1
1
1
1
1
1
-1
-1
1
1
-1
-1
1
1
-1
1
-1
-1
1
-1
1
1
1
-1
-1
-1
-1
1
1
-1
1
1
-1
1
-1
-1
1
("r) =
Half-Sample
(" r) is calculated from the data in Table 13.2.
( " r)
229
" r) -
The average of
" r)
12345678
4732.4
4439.8
4741.3
4344.3
4084.6
4592.0
4123.7
4555.5
78,792.5
141.6
83,868.2
11,534.8
134,762.4
19,684.1
107,584.0
10,774.4
Average
4451.7
55892.8
for the eight replicate half-samples is 55,892.75, which is the same as
for sampling with replacement. Note that we can do the BRR estimation above by
creating a new variable of weights for each replicate half-sample. The sampling weight for
observation I in stratum h is whi = Nh / nh, and
In BRR with a stratified random sample, we eliminate one of the two observations in stratum h to
calculate yh("r). To compensate, we double the weight for the remaining observation. Define
" r.
("r) =
Then,
ystr("r) =
Similarly, for any statistic calculated using the weights whi, ("r) is calculated exactly the same
way, but using the new weights whi("r). Using the new weight variables instead of selecting the
subset of observations simplifies calculations for surveys with many response variables–the same
column w("r) can be used to find the rth half-sample estimate for all quantities of interest. The
230
modified weights also make it easy to extend the method to stratified multistage samples.
13.3.1.2.
BRR in a Stratified Multistage Survey
When is the only quantity of interest in a stratified random sample, BRR is simply a fancy method
of calculating the variance of Equation (4.5) and adds little extra to the procedure in Chapter 4.
BRR’s value in a complex survey comes from its ability to estimate the variance of a general
population quantity 2, where 2 may be a ratio of two variables, a correlation coefficient, a quantile,
or another quantity of interest.
Suppose the population has H strata, and two PSUs are selected from stratum h with unequal
probabilities and with replacement. (In replication methods, we like sampling with replacement
because the subsampling design does not affect the variance estimator, as we saw in Section 6.3).
The same method may be used when sampling is done without replacement in each stratum, but the
estimated variance of
calculated under the assumption of with-replacement sampling, is expected
to be larger than the without-replacement variance.
The data file for a complex survey with two PSUs per stratum often resembles that shown in Table
13.3, after sorting by stratum and PSU.
The vector "r defines the half-sample r: If "rh = 1, then all observation units in PSU 1 of stratum h
are in half-sample r; if "rh = -1, then all observation units in PSU 2 of stratum h are in half-sample r.
The vectors "r, are selected in a balanced way, exactly as in stratified random sampling. Now, for
half-sample r, create a new column of weights w("r):
wi("r) =
Table 13.3: Data Structure After Sorting
Observation
Number
Stratum
Number
PSU
Number
SSU
Number
W eight. w i
Response
Variable 1
Response
Variable 2
Response
Variable 3
1
2
3
4
5
6
7
8
9
10
11
etc.
1.111e+10
1.111e+10
1.234e+10
w1
w2
w3
w4
w5
w6
w7
w8
w9
w 10
w 11
y1
y2
y3
y4
y5
y6
y7
y8
y9
y 10
y 11
x1
x2
x3
x4
x5
x6
x7
x8
x9
x 10
x 11
u1
u2
u3
u4
u5
u6
u7
u8
u9
u 10
u 11
For the data structure in Table 13.3, and "rh = -1 and "rh = 1, the column w("r) will be
(0, 0, 0, 0, 2w5, 2w6, 2w7, 2w8, 2w9, 2w10, 2w11, ....).
231
Now use the column w("r) instead of w to estimate quantities for half-sample r. The estimate of the
population total of y for the full sample is
sample r is
("r) yi. If
the estimate of the population total of Y for halfthen
("r) =
and
("r) yi /
("r) xi.
We saw in Section 7.3 that the empirical distribution function is calculated using the weights
Then, the empirical distribution using half-sample r is
If 2 is the population median, then
and
may be defined as the smallest value of y for which
is the smallest value of y for which
For any quantity, we define
(13.6)
BRR can also be used to estimate covariances of statistics: If 2 and 0 are two quantities of interest,
then
Other BRR variance estimators, variations of (13.6), are described in Exercise 7.
While the exact equivalence of
and
does not extend to nonlinear
statistics, Krewski and Rao (1981) and Rao and Wu (1985) show that if h is a smooth function of the
population totals, the variance estimate from BRR is asymptotically equivalent to that from
linearization. BRR also provides a consistent estimator of the variance for quantiles when a
stratified random sample is taken (Shao and Wu 1992).
Example 13.5
Bye and Gallicchio (1993) describe BRR estimates of variance in the U. S. Survey of Income and
Program Participation (SIPP). SIPP, like the National Crime Victimization Survey (NCVS), has a
232
stratified multistage cluster design. Self-representing (SR) strata consist of one PSU that is sampled
with probability 1, and one PSU is selected with PPS from each non-self-representing (NSR)
stratum. Strictly speaking, BRR does not apply since only one PSU is selected in each stratum, and
BRR requires two PSUs per stratum. To use BRR, “pseudostrata” and “pseudo-PSUs” were formed.
A typical pseudostratum was formed by combining an SR stratum with two similar NSR strata: the
PSU selected in each NSR stratum was randomly assigned to one of the two pseudo-PSUs, and the
segments in the SR PSU were randomly split between the two pseudo-PSUs. This procedure created
72 pseudostrata, each with two pseudo-PSUs.
The 72 half-samples, each containing the observations from one pseudo-PSU from each
pseudostratum, were formed using a 71-factor Plackett-Burman (1946) design. This design is
orthogonal, so the set of replicate half-samples is balanced.
About 8500 of the 54,000 persons in the 1990 sample said they received Social Security benefits;
Bye and Gallicchio wanted to estimate the mean and median monthly benefit amount for persons
receiving benefits, for a variety of subpopulations. The mean monthly benefit for married males was
estimated as
where yi is the monthly benefit amount for person I in the sample, wi is the weight assigned to person
I, and SM is the subset of the sample consisting of married males receiving Social Security benefits.
The median benefit payment can be estimated from the empirical distribution function for the
married men in the sample:
The estimate of the sample median,
satisfies
but
for all
Calculating for a replicate is simple: merely define a new weight variable w("r), as previously
described, and use w("r) instead of w to estimate the mean and median.
Advantages BRR gives a variance estimate that is asymptotically equivalent to that from
linearization methods for smooth functions of population totals and for quantiles. It requires
relatively few computations when compared with the jackknife and the bootstrap.
Disadvantages As defined earlier, BRR requires a two-PSU-per-stratum design. In practice,
though, it is often extended to other sampling designs by using more complicated balancing schemes.
BRR, like the jackknife and bootstrap, estimates the with-replacement variance and may
overestimate the variance if the Nh’s, the number of PSUs in stratum h in the population, are small.
13.3.2. The Jackknife
233
The jackknife method, like BRR, extends the random group method by allowing the replicate
groups to overlap. The jackknife was introduced by Quenouille (1949; 1956) as a method of
reducing bias; Tukey (1958) used it to estimate variances and calculate confidence intervals. In this
section, we describe the delete-1 jackknife; Shao and Tu (1995) discuss other forms of the jackknife
and give theoretical results.
For an SRS, let
be the estimator of the same form as
then
but not using observation j. Thus, if
For an SRS, define the delete-1 jackknife estimator
(so called because we delete one observation in each replicate) as
(13.7)
Why the multiplier (n - 1) / n? Let’s look at
when
When
Then,
Thus,
the with-replacement estimate of the variance of
Example 13.6
Let’s use the jackknife to estimate the ratio of nonresident tuition to resident tuition for the first
group of colleges in Table 13.1. Here,
and
For each jackknife group, omit one observation. Thus,
(Table 13.4).
Here,
and
234
is the average of all x’s except for
Table 13.4: Jackknife Calculations for Example 13.6
j
123
456
789
10
x
1e+38
y
1580.6
1545.9
1565.6
1612.2
1523.9
1391.0
1560.9
1628.9
1583.3
1597.8
4e+39
3617.7
3480.3
3867.3
3794.0
3759.0
3463.4
3595.1
3584.0
3574.0
3571.1
2.2889
2.2513
2.4703
2.3533
2.4667
2.4899
2.3032
2.2003
2.2573
2.2350
How can we extend this to a cluster sample? One might think that you could just delete one
observation unit at a time, but that will not work–deleting one observation unit at a time destroys the
cluster structure and gives an estimate of the variance that is only correct if the intraclass correlation
is zero. In any resampling method and in the random group method, keep observation units within a
PSU together while constructing the replicates–this preserves the dependence among observation
units within the same PSU. For a cluster sample, then, we would apply the jackknife variance
estimator in (13.7) by letting n be the number of PSUs and letting
be the estimate of 2 that we
would obtain by deleting all the observations in PSU j.
In a stratified multistage cluster sample, the jackknife is applied separately in each stratum at the first
stage of sampling, with one PSU deleted at a time. Suppose there are H strata, and nh PSUs are
chosen for the sample from stratum h. Assume these PSUs are chosen with replacement.
To apply the jackknife, delete one PSU at a time. Let
when PSU j of stratum h is omitted. To calculate
Then use the weights wi(h j) to calculate
and
(13.8)
235
be the estimator of the same form as
define a new weight variable: Let
Example 13.7
Here we use the jackknife to calculate the variance of the mean egg volume from Example 5.6. We
calculated
In that example, since we did not know the
number of clutches in the population, we calculated the with-replacement variance.
First, find the weight vector for each of the 184 jackknife iterations. We have only one stratum, so h
= 1 for all observations. For
delete the first PSU. Thus, the new weights for the observations
in the first PSU are 0; the weights in all remaining PSUs are the previous weights times nh /(nh - 1) =
184/183. Using the weights from Example 5.8, the new jackknife weight columns are shown in
Table 13.5.
Table 13.5: Jackknife Weights, For Example 13.7
clutch
1
1
2
2
3
3
4
4
.
.
.
183
183
184
184
Sum
csize
relweight
w(1,1)
w(1,2)
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
13
13
13
13
6
6
11
11
.
.
.
13
13
12
12
6.5
6.5
6.5
6.5
3
3
5.5
5.5
.
.
.
6.5
6.5
6
6
0
0
6.535519
6.535519
3.016393
3.016393
5.530055
5.530055
.
.
.
6.535519
6.535519
6.032787
6.032787
6.535519
6.535519
0
0
3.016393
3.016393
5.530055
5.530055
.
.
.
6.535519
6.535519
6.032787
6.032787
3514
1757
1753.53
1753.53
w(1,184)
6.535519
6.535519
6.535519
6.535519
3.016393
3.016393
5.530055
5.530055
.
.
.
6.535519
6.535519
0
0
1754.54
Note that the sums of the jackknife weights vary from column to column because the original sample
is not self-weighting. We calculated
as
to find
follow the same
procedure but use wi(h j) in place of wi.
Thus,
Using (13.8) then, we calculate
same as calculated in Example 5.6.
This results in a standard error of 0.061, the
Advantages This is an all-purpose method. The same procedure is used to estimate the variance
for every statistic for which the jackknife can be used. The jackknife works in stratified multistage
samples in which BRR does not apply because more than two PSUs are sampled in each stratum.
The jackknife provides a consistent estimator of the variance when 2 is a smooth function of
population totals (Krewski and Rao 1981).
236
Disadvantages The jackknife performs poorly for estimating the variances of some statistics. For
example, the jackknife produces a poor estimate of the variance of quantiles in an SRS. Little is
known about how the jackknife performs in unequal-probability, without-replacement sampling
designs in general.
13.3.3. The Bootstrap
As with the jackknife, theoretical results for the bootstrap were developed for areas of statistics
other than survey sampling; Shao and Tu (1995) summarize theoretical results for the bootstrap in
complete survey samples. We first describe the bootstrap for an SRS with replacement, as developed
by Efron (1979, 1982) and described in Efron and Tibshirani (1993). Suppose S is an SRS of size n.
We hope, in drawing the sample, that it reproduces properties of the whole population. We then treat
the sample S as if it were a population and take resamples from S. If the sample really is similar to
the population–if the empirical probability mass function (epmf) of the sample is similar to the
probability mass function of the population–then samples generated from the epmf should behave
like samples taken from the population.
Example 13.8
Let’s use the bootstrap to estimate the variance of the median height, 2, in the height population
from Example 7.3, using the sample in the file ht.srs. The population median height is 2 = 168; the
sample median from ht.srs is
Figure 7.2, the probability mass function for the population,
and Figure 7.3, the histogram of the sample, are similar in shape (largely because the sample size for
the SRS is large), so we would expect that taking an SRS of size n with replacement from S would
be like taking an SRS with replacement from the population. A resample from S, though, will not be
exactly the same as S because the resample is with replacement–some observations in S may occur
twice or more in the resample, while other observations in S may not occur at all.
We take an SRS of size 200 with replacement from S to form the first resample. The first resample
from S has an epmf similar to but not identical to that of S; the resample median
Repeating the process, the second resample from S has median
We take a total of R =
2000 resamples from S and calculate the sample median from each sample,
obtaining
We obtain the following frequency table for the 2000 sample medians:
Frequency
Median of
Resample
1
165
5
166
2
166.5
40
167
15
167.5
268
168
237
87
168.5
739
169
111
169.5
491
170
44
170.5
188
171
5
171.5
4
172
The sample mean of these 2000 values is 169.3, and the sample variance of these 2000 values is
0.9148; this is the bootstrap estimator of the variance. The bootstrap distribution may be used to
calculate a confidence interval directly: since it estimates the sampling distribution of
a 95% CI is
calculated by finding the 2.5 percentile and the 97.5 percentile of the bootstrap distribution. For this
distribution, a 95% CI for the median is [167.5, 171].
If the original SRS is without replacement, Gross (1980) proposes creating N/n copies of the sample
to form a “pseudopopulation,” then drawing R SRSs without replacement from the
pseudopopulation. If n/N is small, the with-replacement and without-replacement bootstrap
distributions should be similar.
Sitter (1992) describes and compares three bootstrap methods for complex surveys. In all these
methods, bootstrapping is applied within each stratum. Here are steps for using one version of the
rescaling bootstrap of Rao and Wu (1988) for a stratified random sample:
1.
For each stratum, draw an SRS of size (nh - 1) with replacement from the sample in stratum
h. Do this independently for each stratum.
2.
For each resample r ® = 1, 2, . . . , R), create a new weight variable
where mi®) is the number of times that observation I is selected to be in the resample.
Calculate
using the weights wi®).
3.
Repeat steps 1 and 2 R times, for R a large number.
4.
Calculate
238
Advantages The bootstrap will work for nonsmooth functions (such as quantiles) in general
sampling designs. The bootstrap is well suited for finding confidence intervals directly: to get a 90%
CI, merely take the 5th and 95th percentiles from
that described in Efron (1982).
or use a bootstrap-t method such as
Disadvantages The bootstrap requires more computations than BRR or jackknife since R is
typically a very large number. Compared with BRR and jackknife, less theoretical work has been
done on properties of the bootstrap in complex sampling designs.
13.4
Generalized Variance Functions
In many large government surveys such as the U. S. Current Population Survey (CPS) or the
Canadian Labour Force Survey, hundreds or thousands of estimates are calculated and published.
The agencies analyzing the survey results could calculate standard errors for each published estimate
and publish additional tables of the standard errors but that would add greatly to the labor involved in
publishing timely estimates from the surveys. In addition, other analysts of the public-use tapes may
wish to calculate additional estimates, and the public-use tapes may not provide enough information
to allow calculation of standard errors.
Generalized variance functions (GVFs) are provided in a number of surveys to calculate standard
errors. They have been used for the CPS since 1947. Here, we describe some GVFs in the 1990
NCVS.
Criminal Victimization in the United States, 1990 (U. S. Department of Justice 1992, 146) gives
GVF formulas for calculating standard errors. If
victimized by a particular type of crime or if
is an estimated number of persons or households
estimates a total number of victimization incidents,
(13.9)
If p is an estimated proportion,
(13.10)
where
is the estimated base population for the proportion. For the 1990 NCVS, the values of a
and b were a = -.00001833 and b = 3725. For example, it was estimated that 1.23% of persons aged
20 to 24 were robbed in 1990 and that 18,017,100 persons were in that age group. Thus, the GVF
estimate of SE(p) is
Assuming that asymptotic results apply, this gives an approximate 95% CI of .0123 ± (1.96)(.0016),
or [.0091, .0153].
239
There were an estimated 800,510 completed robberies in 1990. Using (13.9), the standard error of
this estimate is
Where do these formulas come from? Suppose Ti is the total is the total number of observation units
belonging to a class–say, the total number of persons in the United States who were victims of
violent crime in 1990. Let Pi = Ti/N, the proportion of persons in the population belonging to that
class. If di is the design effect (deff) in the survey for estimating Pi (see Section 7.5), then
(13.11)
where bi = di x (N / n). Similarly,
where ai = -di / n. If estimating a proportion in a domain–say, the proportion of persons in the 20-24
age group who were robbery victims–the denominator in (13.11) is changed to the estimated
population size of the domain (see Section 3.3).
If the deff’s are similar for different estimates so that
and
then constants a and b can
be estimated that give (13.9) and (13.10) as approximations to the variance for a number of
quantities. The general procedure for constructing a generalized variance function is as follows:
1.
Using replication or some other method, estimate variances for k population totals of special
interest,
Let vi be the relative variance for
for I = 1, 2, . . . , k.
2.
Postulate a model relating vi to
Many surveys use the model
This is a linear regression model with response variable vi and explanatory variable
Valliant (1987) found that this model produces consistent estimates of the variances for the
class of superpopulation models he studied.
3.
Use regression techniques to estimate " and $. Valliant (1987) suggests using weighted least
squares to estimate the parameters, giving higher weight to items with small vi. The GVF
240
estimate of variance, then, is the predicted value from the regression equation,
The ai and bi for individual items are replaced by quantities a and b, which are calculated from all k
items. For the 1990 NCVS, b = 3725. Most weights in the 1990 NCVS are between 1500 and 2500;
b approximately equals the (average weight) x (deff), if the overall design effect is about 2.
Valliant (1987) found that if deff’s for the k estimated totals are similar, the GVF variances were
often more stable than the direct estimate, as they smooth out some of the fluctuations from item to
item. If a quantity of interest does not follow the model in step 2, however, the GVF estimate of the
variance is likely to be poor, and you can only know that it is poor by calculating the variance
directly.
Advantages The GVF may be used when insufficient information is provided on the public-use
tapes to allow direct calculation of standard errors. The data collector can calculate the GVF, and the
data collector often has more information for estimating variances than is released to the public. A
generalized variance function saves a great deal of time and speeds production of annual reports. It
is also useful for designing similar surveys in the future.
Disadvantages The model relating vi to may not be appropriate for the quantity you are
interested in, resulting in an unreliable estimate of the variance. You must be careful about using
GVFs for estimates not included when calculating the regression parameters. If a subpopulation has
an unusually high degree of clustering (and hence a high deff), the GVF estimate of the variance may
be much too small.
13.5
Confidence Intervals
13.5.1. Confidence Intervals for Smooth Functions of Population Totals
Theoretical results exist for most of the variance estimation methods discussed in this chapter,
stating that under certain assumptions
asymptotically follows a standard normal
distribution. These results and conditions are given in Binder (1983), for linearization estimates; in
Krewski and Rao (1981) and Rao and Wu (1985), for jackknife and BRR; in Rao and Wu (1988) and
Sitter (1992), for bootstrap. Consequently, when the assumptions are met, an approximate 95%
confidence interval for 2 may be constructed as
Alternatively, a tdf percentile may be substituted for 1.96, with df = (number of groups - 1) for the
random group method. Rust and Rao (1996) give guidelines for appropriate df’s for other methods.
Roughly speaking, the assumptions for linearization, jackknife, BRR, and bootstrap are as follows:
1.
The quantity of interest 2 can be expressed as a smooth function of the population totals;
241
more precisely, 2 = h(T1, T2, . . . , Tk), where the second-order partial derivatives of h are
continuous.
2.
The sample sizes are large: either the number of PSUs sampled in each stratum is large, or
the survey contains a large number of strata. (See Rao and Wu 1985 for the precise technical
conditions needed.) Also, to construct a confidence interval using the normal distribution,
the sample sizes must be large enough so that the sampling distribution of
normal.
is approximately
Furthermore, a number of simulation studies indicate that these confidence intervals behave well in
practice. Wolter (1985) summarizes some of the simulation studies; others are found in Kovar et al.
(1988) and Rao et al. (1992). These studies indicate that the jackknife and linearization methods
tend to give similar estimates of the variance, while the bootstrap and BRR procedures give slightly
larger estimates. Sometimes a transformation may be used so that the sampling distribution of a
statistic is closer to a normal distribution: if estimating total income, for example, a log
transformation may be used because the distribution of income is extremely skewed.
13.5.2. Confidence Intervals for Population Quantiles
The theoretical results described above for BRR, jackknife, bootstrap, and linearization do not apply
to population quantiles, however, because they are not smooth functions of population totals.
Special methods have been developed to construct confidence intervals for quantiles; McCarthy
(1993) compares several confidence intervals for the median, and his discussion applies to other
quantiles as well.
Let q be between 0 and 1. Then define the quantile 2q as 2q = F-1(q), where F-1(q) is defined to be the
smallest value y satisfying F(y) $q. Similarly, define
Now F-1 and
are not
smooth functions, but we assume the population and sample are large enough so that they can be
well approximated by continuous functions.
Some of the methods already discussed work quite well for constructing confidence intervals for
quantiles. The random group method works well if the number of random groups, R, is moderate.
Let
be the estimated quantile from random group r. Then, a confidence interval for 2q is
where t is the appropriate percentile from a t distribution with R - 1 df. Similarly, empirical studies
by McCarthy (1993), Kovar et al. (1988), Sitter (1992), and Rao et al. (1992) indicate that in certain
designs confidence intervals can be formed using
242
where the variance estimate is calculated using BRR or bootstrap.
An alternative interval can be constructed based on a method introduced by Woodruff (1952). For
any y,
where ui = 1 if
is a function of population totals:
and ui = 0 if yi > y. Thus, a method in this chapter can be used to estimate
and an approximate 95% CI for F(y) is given by
for any value y,
Now let’s use the
confidence
interval for q =
F(2q) to obtain an
approximate
confidence
interval for 2q.
Since we have a
95% CI,
=
Figure 13.2
Woodruff’s confidence interval for the quantile 2q if the empirical distribution function is
continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI) for any
value of y, shown on the vertical axis. We then look at the corresponding points on the horizontal
axis for form a confidence interval for 2q.
243
Figure 13.2 shows Woodruff’s confidence interval for the quantile 2q if the empirical distribution
function is continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI)
for any value of y, shown on the vertical axis. We then look at the corresponding points on the
horizontal axis to form a confidence interval for 2q.
So an approximate 95% CI for the quantile 2q is
The derivation of this confidence interval is illustrated in Figure 13.2.
Now we need several technical assumptions to use the Woodruff-method interval. These
assumptions are stated by Rao and Wu (1987) and Francisco and Fuller (1991), who studied a similar
confidence interval. Basically, the problem is that both F and are step functions; they have jumps
at the values of y in the population and sample. The technical conditions basically say that the jumps
in F and in
should be small and that the sampling distribution of
is approximately normal.
Example 13.9
Let’s use Woodruff’s method to construct a 95% CI for the median height in the file ht.srs, discussed
in Examples 7.3 and 13.8. Note that
is the sample proportion of observations in the SRS that
take on value at most 2q; so, ignoring the fpc,
Thus, for this sample,
The lower confidence bound for the median is then
bound for the median is
and the upper confidence
As heights were only measured to the nearest centimeter,
we’ll use linear interpolation to smooth the step function
for the empirical distribution function:
y
244
The following values were obtained
167
168
170
171
172
0.405
0.440
0.515
0.550
0.605
Then, interpolating,
and
Thus, an approximate 95% CI for the median is [167.7, 171.4].
13.5.3. Conditional Confidence Intervals
The confidence intervals presented so far in this chapter have been developed under the design-based
approach. A 95% CI may be interpreted in the repeated-sampling sense that, if samples were
repeatedly taken from the finite population, we would expect 95% of the resulting confidence
intervals to include the true value of the quantity in the population.
Sometimes, especially in situations when ratio estimation or poststratification are used, you may
want to consider constructing a conditional confidence interval instead. In poststratification as used
for nonresponse (Section 8.5.2.), the respondent sample sizes nhR, was presented. A 95% conditional
confidence interval, constructed using the variance in (8.3), would have the interpretation that we
would expect 95% of all samples having those specific values of nhR to yield confidence intervals
containing
The theory of conditional confidence intervals is beyond the scope of this book; we refer the reader
to Särndal et al. (1992, sec. 7.10), Casady and Valliant (1993), and Thompson (1997, sec. 5.12) for
more discussion and bibliography.
13.6
Summary and Software
This chapter has briefly introduced you to some basic types of variance estimation methods that are
used in practice: linearization, random groups, replication, and generalized variance functions. But
this is just an introduction; you are encouraged to read some of the references mentioned in this
chapter before applying these methods to your own complex survey. Much of the research done
exploring properties and behavior of these methods has been done since 1980, and variance
estimation methods are still a subject of research by statisticians.
245
Linearization methods are perhaps the most thoroughly researched in terms of theoretical properties
and have been widely used to find variance estimates in complex surveys. The main drawback of
linearization, though, is that the derivatives need to be calculated for each statistic of interest, and
this complicates the programs for estimating variances. If the statistic you are interested in is not
handled in the software, you must write your own code.
The random group method is an intuitively appealing method for estimating variances. Easy to
explain and to compute, it can be used for almost any statistic of interest. Its main drawback is that
we generally need enough random groups to have a stable estimate of the variance, and the number
of random groups we can form is limited by the number of PSUs sampled in a stratum.
Resampling methods for stratified multistage surveys avoid partial derivatives by computing
estimates for subsamples of the complete sample. They must be constructed carefully, however, so
that the correlation of observations in the same cluster is preserved in the resampling. Resampling
methods require more computing time than linearization but less programming time: the same
method is used on all statistics. They have been shown to be equivalent to linearization for large
samples when the characteristic of interest is a smooth function of population totals.
The BRR method can be used with almost any statistic, but it is usually used only for two-PSU-perstratum designs or for designs that can be reformulated into two PSU per strata. The jackknife and
bootstrap can also be used for most estimators likely to be used in surveys (exception: the delete-1
jackknife may not work well for estimating the variance of quantiles) and may be used in stratified
multistage samples in which more than two PSUs are selected in each sample, but they require more
computing than BRR.
Generalized variance functions are cheap and easy to use but have one major drawback: unless you
can calculate the variance using one of the other methods, you cannot be sure that your statistic
follows the model used to develop the GVF.
All methods except GVFs assume that information on the clustering is available to the data analyst.
In many surveys, such information is not released because it might lead to identification of the
respondents. See Dippo et al. (1984) for a discussion of this problem.
Various software packages have been developed to assist in analyzing data from complex surveys.
Cohen (1997), Lepkowski and Bowles (1996), and Carlson et al. (1993) evaluate PC-based packages
for
analysis of complex survey data.1 SUDAAN (Shah et al. 1995), OSIRIS (Lepkowski 1982), Stata
(StataCorp 1996), and PC-CARP (Fuller et al 1989) all use linearization methods to estimate
variances of nonlinear statistics. SUDAAN, for example, calculates variances of estimated
population totals for various stratified multistage sampling designs that have H strata, unequalprobability cluster sampling with or without replacement at the first stage of sampling, and SRS with
or without replacement at subsequent stages. The formula in (6.9) is used to estimate the variance
1
Lepkowski and Bowles (1996) tell how to access the free (or almost-free) software packages CENVAR,
CLUSTERS, Epi Info, VPLX, and W esVarPC through e-mail or from the internet. Software for analysis of survey
data is changing rapidly; the Survey Research Methods Section of the American Statistical Association
(www.amstat.org) is a good resource for updated information.
246
for each stratum in with-replacement sampling, and the Sen-Yates-Grundy form in (6.15) is used for
without-replacement variance. Then, the variances for the totals in the strata are added to estimate
the variance for the estimated population total. SUDAAN then uses linearization to find variances
for ratios, regression coefficients, and other nonlinear statistics. Recent versions of SUDAAN also
implement BRR and jackknife.
OSIRIS also implements BRR and jackknife methods. The survey software packages WesVarPC
(Brick et al. 1996; at press time, WesVarPC could be downloaded free from www.westat.com) and
VPLX (Fay 1990) both use resampling methods to calculate variance estimates. A simple S-PLUS
function for jackknife is given in Appendix D; this is not intended to substitute for well-tested
commercial software but to give you an idea of how these calculations might be done. Then, after
you understand the principles of the methods, you can use commercial software for your complex
surveys.
247
CHAPTER 14
SAMPLING FOR OBJECTIVE MEASUREMENT
SURVEYS IN AGRICULTURE
14.1
NEED FOR OBJECTIVE MEASUREMENTS
The principles of sampling discussed in the previous lectures are widely applicable to survey
programs generally. Certain kinds of surveys, however, may require special techniques of sampling
and data collection which are determined by the nature of the inquiry or the ability of respondents to
give accurate answers. Chapter 12 describes some special techniques used in agriculture surveys.
Statistics on area planted with individual crops and on yields from these crops are, in most countries,
based upon periodic reports from crop reporters. In some countries, these reporters are holders or
other individuals who reside in the rural areas and have knowledge of the local agriculture; they
report voluntarily, usually by mail. In other countries the reporters are government officials or
agents. The reports submitted by these agents are usually less accurate than those submitted by
private individuals, in part because the agents are usually reporting for a much larger area and in part
because the agents are not so closely connected with agriculture. However, whether made by private
individuals or by government agents, these reports are all subject to biases which are often large and
always difficult to evaluate. For example, investigations in various countries have shown that in
estimating yields, reporters (particularly official reporters) have a tendency to be biased toward the
normal; in other words, in good years they tend to underestimate the yield whereas in bad years they
tend to overestimate. Although private reporters also have this tendency to some extent, they are
generally more inclined to underestimate in the belief that it will be to their advantage to do so.
Areas, on the other hand, tend to be overestimated because of the difficulty of making proper
allowances for non planted areas around the edges of fields and areas within the fields that cannot be
planted.
Check data from past years can be used to evaluate the biases in the estimates of production obtained
from reporters. For crops such as tobacco or cotton, which must be processed before being used,
information on production can be obtained from the processors and compared with the corresponding
figures obtained from reporters. For other crops, similar use can be made of data obtained from
marketing or shipping sources. If such data are complete (usually there is no guarantee that they are
complete) and if the relative bias remains reasonably constant from year to year, estimates for the
current year can be adjusted on the basis of this past experience. For other crops, which are at least
partly consumed locally, fed to livestock, etc., such check data are not available. Census data, if
available, can be used as a benchmark for adjusting the reports for these crops. However, the census
data are also subject to reporting biases. Furthermore, adjustments using census data become less
and less reliable as the time lapse between the last census and the current year widens.
Experience in many different countries under a variety of conditions has indicated that subjective
methods of estimating production, even when other data are available for adjusting the estimates,
cannot provide reliable results. If accurate and unbiased estimates are required, the only alternative
is to establish some type of program utilizing objective methods of observation applied on a random
sampling basis. Such surveys are called "objective measurement surveys" because the data are
collected by actual observation and measurement or counting, rather than by methods depending on
the judgment, good memory, or education of persons who report the required information. Even
though such a program of objective measurement surveys is relatively costly and difficult to carry
out, the results will usually justify the effort.
14.2
DESIGNING THE SAMPLE
The theoretical considerations affecting sample design, discussed in previous lectures, are as relevant
to the design of an objective measurement survey as they are to any other survey.
14.2.1
Type of estimates required
The sampling statistician must know whether estimates are required for the nation as a whole, for the
Provinces or districts individually, or for some other administrative areas. The sample allocation
must be planned to give estimates for the desired areas at an acceptable level of reliability. If an
estimate of the number of holdings (either in total or for a specific crop) is also required, this must be
considered in designing the sample.
14.2.2
Stratification
First-level strata often consist of the smallest areas requiring separate estimates. Further gains in
efficiency may be obtained by further stratification into geographic areas having relatively
homogeneous yield rates for the crop. Other bases for stratification, such as irrigated and
nonirrigated land, varieties of crops, etc., may also be used.
14.2.3
Allocation to strata
The statistician must decide how to allocate the sample to strata. A common practice is to allocate it
proportionately to the area under the particular crop or group of crops being investigated. If
available, knowledge about the relative variances and/or the relative costs of performing the field
work in the different strata should also be used in allocating the sample.
14.2.4
Sampling within strata
A decision must be made on the method of sampling within strata. As was indicated before, there
are usually several possible sampling units and sample designs. In deciding upon a sampling plan,
the sampling statistician will need to know what materials are available for constructing the sampling
frame and what types of data are required. His choice may also be influenced by other factors such
as the availability of capable personnel to carry out the work. However, even with the restrictions
imposed by these considerations, there will usually be a number of possible choices.
14.2.4.1 Sample stages and types of sampling units
In most practical applications, several sampling stages and sampling units will be used within strata.
For example, if the strata are large administrative divisions, such as Provinces, a sample of districts
might be selected at the first stage and a sample of subdistricts within sample districts at the second
stage. Where "villages" have identifiable boundaries and account for all the land, they can serve as
convenient units at some stage in the sampling. The ultimate unit of analysis will usually be an
individual holding, the individual field, or (for studies involving estimation of yields) small plots
within fields. If the field is the unit of analysis, holdings may be selected at the preceding stage.
249
14.2.4.2 Methods of selecting holdings and fields
The following examples illustrate some procedures that can be used to select holdings and fields in
the final stages of the sample design. The selection of plots within fields is discussed in section
14.4.4 of this chapter.
(1)
Holdings can be selected from lists if lists are available or can be constructed without much
difficulty. Lists of holdings would be needed only for the units (villages, subdistricts, etc.)
actually selected in the sample at the preceding stage; if necessary, these could be compiled as
part of the field operation. The selection of holdings can be made either with equal probability
or with probability proportionate to size (assuming that information on size is available or can
be obtained). The measure of size might be total reported area in the holding, total area in a
particular crop or group of crops, etc.
Similarly, within each selected holding, a list of fields could be compiled and a sample
selected. Again, selection could be made either with equal probability or with probability
proportionate to size.
(2)
If maps or aerial photographs are available, these can be used to select fields directly without
first selecting holdings. One way to do this is to superimpose on the map or photo a grid on
which dots have been placed either in a systematic pattern or at random; each field into which a
dot falls is then included in the sample, thus giving the fields probabilities of selection
proportionate to their sizes. This procedure requires, of course, that the maps or photos be
sufficiently detailed so that the point and the corresponding field can be located on the ground.
(This procedure is not easily adaptable to estimating number of holdings, if that is desired.)
(3)
Area segments are useful sampling units for determining which holdings and/or fields are to be
included in the sample. These segments may be constructed either with natural boundaries that
can be located on the ground or with imaginary boundaries drawn on a photo or map; the
choice depends upon the particular situation. Holdings and/or fields may be associated with
area segments in any of the following ways:
(a) Area segments with imaginary boundaries could be used as first-stage sampling units and a
sample of segments selected; within the sample segments, fields could be selected as
second-stage units in the manner described above in (2).
(b) An alternative procedure would be to include in the sample all fields (or holdings) for
which a uniquely defined point falls within the segment boundaries. With this procedure,
fields (or holdings) would not be selected with probability proportionate to their sizes; the
probability of selection would be the same as the probability of selection of the segment
into which the point falls. This is known as an open segment approach. The segments
determine which units are included in the sample, but data are tabulated for some fields (or
holdings) lying partly outside the segment and are not tabulated for other fields (or
holdings) lying partly inside the segment.
250
The unique point must be defined with care. Usually a particular corner of the field
(holding) would be designated as the unique point. Because fields (holdings) may not be
rectangular, a specific rule for locating this corner would be needed as well. For example,
if the northwest corner were the designated unique point, it could be defined either (1) by
identifying the boundary points that lie farthest west and then designating the most
northern of these points as the northwest corner or (2) by identifying the boundary points
that lie farthest north and then designating the most western of these points as the
northwest corner. If the holding were the unit of analysis, the residence of the holder
(provided all such residences had a chance of being included in the sample) would
generally be preferred as the unique point since it would be the easiest point to locate. A
combination of rules is, perhaps, even more useful. For example, the residence of the
holder might be used when the holder lives on the holding, and a particular corner used
when he does not live on the holding. In any case, the point must be defined in a way such
that it is truly unique (that is, each unit must have one, and only one, such point associated
with it and thus have one, and only one, chance of being included in the sample); it should
also be fairly easy to identify.
(c) If the unit of analysis is the holding, the weighted segment approach will usually be more
efficient than the open segment approach. With this procedure, all holdings having any
land in the segment are included in the sample. In the estimation, the data from each
holding are weighted by a factor based on the proportion of the entire holding lying inside
the segments. In almost all applications, the weighted segment approach requires that the
segments have natural boundaries that can be identified on the ground.
(d) Still another possibility is to use the so-called closed-segment approach in which only
those fields or parts of fields lying within the segments are included in the sample. One
advantage of this procedure is that it avoids the difficulty of having to define the holding.
Of course, if information is desired on a holding basis, the closed-segment approach is not
appropriate since some holdings will certainly extend beyond the segment boundaries.
14.3
OBJECTIVE MEASUREMENT PROCEDURES FOR THE
ESTIMATION OF AREA
Since it is known that data on land area obtained by asking individuals to respond to questionnaires
can be very inaccurate, other means of obtaining these data have been investigated.1 The usual
approach in objective measurement surveys is to select a sample of areas, and then to go to these
areas and measure them directly. There are also methods of obtaining objective estimates of area
that do not require direct measurement of the land; for example, measuring the area on aerial
photographs. In addition to the measurements, other information may be obtained. For example, the
land may be classified into various categories according to its use (crop land, pasture, wasteland,
etc.), the particular crop being grown on each piece of land may be identified, etc.
1.
For discussion of techniques and experiences in many countries, see S. S. Zarkovich (ed.), Estimation of Areas in
Agricultural Statistics, Food and Agriculture Organization of the United Nations, Rome, 1965.
251
14.3.1
Measurement of land area
The first step in making direct measurements of land is to make a scale drawing. In order to do this,
one must be able to measure distances and angles. A drawing made by a professional land surveyor
using technical equipment would be very precise. On the other hand, a drawing made by an
inexperienced worker measuring distances by pacing and measuring angles by eye estimates would
not be very accurate. Between these extremes, there are many other methods that can be used. One
should balance the relative cost against the relative accuracy of the various procedures and select the
method that will provide an acceptable level of reliability for the lowest cost.
After the scale drawing has been made, the area of the drawing must be determined. If the land that
was measured is in the shape of a regular geometric figure such as a rectangle, trapezoid, etc., it is
relatively easy to determine the area of the drawing by standard mathematical formulas. Using the
appropriate expansion factor, the area of the land represented by the drawing can then be determined.
Often, however, the area is of irregular shape and other methods must be used; for example,
triangulation, planimetering, gridding, dot counting, and map cutting and weighing.
14.3.1.1 Triangulation.--In triangulation, the polygon formed by the drawing is converted into
simple triangles. It is a principle of geometry that this can always be done. (Curved
boundaries are roughly approximated by a series of straight lines before triangulation.)
Each triangle is measured and the area computed by standard formulas. This procedure is
time consuming and tedious and has largely been replaced.
14.3.1.2 Planimetering.--A planimeter is an instrument with which one can determine the area of a
closed figure by tracing around the boundary of the figure with a pencil-like device. A
good planimeter will give very accurate results. It does, however, require a skilled
operator and much time.
14.3.1.3 Gridding.--Basically, a grid is a plane divided into small squares (for example, a piece of
ordinary graph paper). For use in measuring area, the squares are constructed so that each
is equivalent to a particular amount of area in accordance with the scale of the drawing. A
transparent plastic grid can be placed over the drawing; or the grid can be printed on paper
and the drawing made directly on this paper. To estimate the area represented by the
drawing, one counts the whole squares and parts of squares within the perimeter of the
scale drawing and converts this number to its equivalent in terms of the appropriate unit of
area.
252
Figure 1: MEASUREMENT BY GRIDDING
(1 SQUARE = 1/4 HECTARE)
Although not as accurate as planimetering, gridding can be done in less time. It requires only that
the individual be able to count accurately and that he be able to accurately convert the partial squares
into an equivalent number of whole squares. See Figure 1 on the preceding page for an illustration
of this method. There are approximately 159 squares within the scale drawing (including the partial
squares that overlap the boundary); thus, since each square represents 1/4 hectare, the field contains
about 40 hectares.
14.3.1.4 Dot counting. Dot counting is essentially the same as gridding except that instead of small
squares, the grid consists of uniformly spaced dots. Each dot represents a unit area
according to the scale of the drawing. One need only count the dots lying within the
perimeter of the drawing to find the area. If any dots lie on the boundary, only half of
them are counted.
14.3.1.5 Map cutting and weighing.--By this procedure, the map or photograph of the area is
carefully cut into pieces representing different categories of land along the lines drawn by
the field worker. Each piece is then carefully weighed. The estimation is based on the
253
weight of the paper in each category relative to the weight for the entire area. This
procedure is not very practical; it is time consuming and requires a weighing instrument of
high precision and map paper of uniform quality.
14.3.2
Observation of land uses for a sample of points or lines
Some methods of objectively measuring area do not require direct measurement of the land itself.
Instead, the proportion of land falling into various categories is estimated by some objective means
and multiplied by the known total area of land in the universe (Province, district, etc.) to estimate the
total area in each category. All of the methods discussed in section 3.2 except the last method (the
last method described in paragraph 3.22) require accurate, up-to-date maps or aerial photographs;
consequently, their usefulness is somewhat limited at this time. However, as progress is made in
aerial photography, these and similar methods are likely to become more generally useful in the
future.
14.3.2.1 Observations for a sample of points.--A sample of points is selected and the points
marked on maps or aerial photographs. In selecting the sample of points, appropriate techniques of
stratification and clustering should be used to maximize the efficiency of the design. For example, if
primary interest is in the estimation of crop areas, higher sampling rates should be used in those
portions of the universe known to consist primarily of crop land.
If only broad categories of land use are to be estimated, and suitable aerial photographs are available,
it may be possible to make the necessary observations directly from the photographs. For most
purposes, however, it will be necessary to send observers to the field to locate each sample point and
to record the crop being grown or other use being made of the land at the point.
One author has suggested that for periodic surveys the sample points be permanently identified by
suitable markers, to make them easier to locate. The markers could not be placed at the exact
locations of the sample points, since they would interfere with farming operations; however, they
would be placed nearby and equipped with sighting devices aimed at the sample points. This method
has not yet been tried in the field. (Refer to "Fixed-Point Sampling--A New Method of Estimating
Crop Areas" by Thomas B. Jabine in Estadistica, published by the Inter-American Statistical
Institute, Washington, D.C., September-December 1967.)
Once the observations have been made for the sample of points, one can make an unbiased estimate
of area devoted to a particular use:
(1)
For each stratum in which points were sampled at a constant rate, tally the number of sample
points in each land use category.
(2)
Multiply the known total area of the stratum by the proportion of sample points devoted to that
use.
(3)
Sum over all strata.
254
14.3.2.2 Observations for a sample of lines.--A sample of lines is selected and the lines are
marked on maps or aerial photographs. As in the case of points, appropriate techniques of
stratification and clustering should be used to increase the efficiency of the design. The usual
procedure within ultimate sampling units is to select a sample of parallel lines spaced at equal
intervals.
By using aerial photographs, or by actually pacing the lines, the investigator determines the
proportion of each line falling into each land use category. Unbiased estimates are then made from
these observations by a procedure completely analogous to that described above for point samples.
A relatively cheap but biased form of line sampling involves the substitution of roads for a
probability sample of lines. The investigator drives a car along a prescribed route. The car is
equipped with a distance measuring device. As he drives, the investigator notes and records the
distance for which the road is bordered by each category of land being measured (specific crops, crop
land in general, pasture, woodland, etc.). Estimates are then made in the normal way for line
sampling.
This last technique is likely to be seriously biased, especially in areas where the road network is
sparse, since the pattern of land use along roads is likely to differ substantially from the overall
pattern for a given area. Techniques based on probability sampling should be used in preference if at
all possible.
14.3.3
Use of ratio estimation and double sampling to improve efficiency
Having completed area measurements on the holdings (or other units of analysis) in the sample, we
can estimate totals directly from these data by the estimation procedure which is appropriate to the
particular sample design. This procedure can usually be improved upon, however, if in addition to
making area measurements for a sample of the population, we also have available less accurate and
less expensive area data (for example, data obtained by direct interview) from the entire population.
Such data would normally come from a complete census. By means of ratio estimation, we can often
obtain estimates of population totals that will be more reliable than those that could be obtained from
either the objective measurements or the interview responses alone. The procedure is essentially the
same as that discussed in section 2.3 of chapter 10. The X-characteristic in this case would be the
actual measurement of the land obtained for a subset of the population; the Y-characteristic would be
the data collected by the interview.
Even more useful and practical is a technique called double sampling2 in which the less expensive
technique is used to obtain data from a relatively large sample of the population and the more
expensive technique to obtain data from a subsample of the basic sample. Again, ratio estimation is
used, but here the Y-characteristic is the response that is obtained by the less expensive technique,
and the sample estimate of the population total for the Y-characteristic is used in place of a total
based on 100-percent coverage.
2.
Double sampling is a statistical technique useful in a variety of situations whenever a characteristic of interest that is difficult
or expensive to determine is correlated highly with another characteristic that can be determined relatively easily or
inexpensively.
255
Compared with the method based on area measurement alone, methods using ratio estimation will be
preferred if the gain in efficiency more than offsets the cost of obtaining the supplementary
observations by the less expensive technique (either from the entire population or, in the case of
double sampling, from a larger sample from the population). The factors to be considered are:
(1)
The strength of the relationship between the data obtained by the two methods. The interview
response must have a high positive correlation with the area measurement if a significant
improvement is to be obtained. One would reasonably expect this to be the case.
(2)
The relative cost of the two methods. Assuming that the correlation is large enough, ratio
estimation will reduce the number of holdings requiring area measurement in order to achieve a
given level of reliability. Whether or not this reduction will offset the cost of obtaining the
interview responses depends in part upon the difference in costs between the two types of
observations.
Compared with the method based only on interview responses, the use of ratio estimation will be
preferred whenever it is believed that the bias in the interview responses is sufficient to justify the
additional expense of obtaining the area measurements. The concept of mean square error (MSE) is
needed to understand the situation more fully. Recall from previous chapters that the variance is
based on differences between estimates (x') based on samples and the value X that would be obtained
if data had been collected from all members of the population, using the same techniques. The mean
square error, on the other hand, is based on differences between estimates based on samples and the
true value of the quantity being measured (XT). If the data-collection technique is unbiased, X = XT,
then the MSE is equivalent to the variance; if the technique is biased, the MSE is equal to the
variance plus the square of the bias (X - XT), or
(14.1)
MSE =
.
For a given cost, data can be obtained by interview from a sample of a certain size. For the same
cost, data can be obtained by interview from a smaller sample, combined with objective
measurements from a subsample of this sample. Estimates based on the large interview sample will
have a specified MSE containing a bias component as well as a variance component. Ratio estimates
based on the combination of interview and objective measurement data will have a smaller bias but a
larger variance. The MSE may be either larger or smaller than the MSE based only on the large
interview sample depending on the variability in the population, the relative cost of the two
procedures (which determines the relative sample sizes), the relative size of the biases (or the
effectiveness of the ratio estimation procedure in reducing the bias), etc. The sampling statistician
must consider all of these factors in allocating the available resources between the two procedures.
His goal is to minimize the MSE for a given cost (or to minimize the cost of obtaining an acceptable
level of reliability).
14.4
OBJECTIVE MEASUREMENT OF YIELD
The goal of objective measurement of yield is usually to estimate the yield of a crop on a unit basis
(such as bushels per acre, quintals per hectare, etc.). In order to estimate the total production, it is
necessary to have also an estimate of the total area of the crop in question planted. In some
256
instances, only the yield is estimated by objective means, although estimates of both the yield and the
area should be based on objective measurements.
The general procedure in making objective measurements of yield (usually called "crop cutting") is
to use a random process to select areas (usually called plots) planted, and to cut and weigh the
produce from each of these plots at or near the time the remainder of the field is harvested.3 Each
different crop has different characteristics, and the same crop will behave differently in different
parts of the world. Consequently, there is no specific set of rules that can be applied to all crops or
even to the same crops in different locations. We will, however, discuss in general terms some of
the factors to be considered in planning such a program and describe some of the techniques that
have been used in the past.
14.4.1
Pilot studies
Because information gained about other crops or about the behavior of the crop in question in other
countries is not directly transferable to one's own situation, pilot studies should be carried out before
establishing any program for objective measurement of yield. Pilot studies can provide important
information about most of the things that need to be considered such as sampling variability,
optimum size and shape of plot, harvesting procedures, problems such as personnel and materials
needed to carry out the work, etc. They are also useful as training devices for those who will
eventually be in charge of the full-scale operation. On the basis of the pilot studies, the investigator
can develop a sampling plan and field procedures appropriate to the conditions under which the
survey will be conducted.
After a procedure has been decided upon, it is usually advisable to put it into operation only
gradually and, after it is in full operation to carry it out for a few years simultaneously with the
procedure it is to replace. The existing program, no matter how inadequate it may be, should not be
ended until the proposed new method has been sufficiently tested and found to be clearly superior
and operationally feasible.4 After its superiority and feasibility have been established, the new
method can then serve as a basis for evaluating the bias in the old method which would not be
possible unless the two were conducted simultaneously for a few years. This is particularly
important to users of the data who are interested in examining differences or trends over a period of
years; they must know to what extent observed differences in the data are simply the result of
differences in measurement technique.
14.4.2
Variability
One must have some idea of the variability in yield of the crop to be measured in order to plan
wisely. Two aspects of variability which are of interest are:
3.
Objective measurements are also used to forecast yields on the basis of observations made earlier in the season.
Since the sampling procedures used in forecasting yields are quite similar to those used in estimating yields,
only the latter are discussed in this section.
4.
Actually, it may be necessary to continue the existing program in any case, particularly if data are required for administrative
areas different from those for which estimates are made using objective data. Furthermore, the existing program may collect
data on a number of crops which are not economically important enough to justify an expensive objective measurement
program.
257
(1)
The relative variability of yields for different sizes and shapes of plots.
(2)
For a plot of given size and shape, the relative magnitude of the variation among fields and the
variation among plots within a field.
In deciding which type of plot to use, the investigator must balance the variability against the cost.
He will attempt to select the plot that will give the desired degree of reliability for the lowest cost,
although other factors (for example, personnel considerations) may force him to choose one that is
quite the best in terms of costs and variances.
Experience has shown that in almost all cases, the variation among fields is considerably greater than
variation within fields. As a result, the number of plots selected within each sample field should be
small so that the available resources can be more efficiently expended on sampling as many different
fields as possible. In fact, in some investigations, the optimum number of plots has been only one
per field.5 A minimum of two plots is necessary, of course, if one wishes to estimate the within-field
variability from the sample; nevertheless, the investigator may choose to have only one plot per field
if the within-field component of variance is very small compared with the between-field component.
14.4.3
Size and shape of plot
Circular, triangular, square, and rectangular plots have all been used in past studies for crops that are
scattered in the field or planted in very closely spaced rows (for example, small grains or hay). For
crops in widely spaced rows (for example, maize or cotton), rectangular plots are the logical choice;
the width is often designated in terms of rows and the length in terms of feet (or meters, etc.).
Along with the shape of the plot, a method of marking it must be specified. Rigid frames or other
devices have been used successfully for marking small plots. Ropes, chains, etc., are easier to
transport but are more difficult to place in the field if the worker has to measure and drive stakes at
the corners, etc. For a triangular plot, a closed chain with rings at the three vertices can be used quite
easily; the same device, provided it forms a right triangle, can also be used to mark rectangular plots
using a suitable combination of triangles. Large plots are usually laid out using pegs or stakes,
string, and a measuring tape.
As the size of the plot increases, the variability among plots decreases; however, since the withinfield contribution to the overall variance is usually negligible relative to the other sources of
variance, small plots are usually preferred from a practical standpoint. One man can usually do the
work alone, he can place a portable frame much faster than he can stake out a large plot, he can
harvest more quickly, and he has less material to handle.
Unfortunately, experience has shown that small plots almost always produce seriously biased
estimates. The reasons for this are not entirely clear, but it appears that two factors are largely
responsible:
(1)
5.
In locating the plot in the field, it is much easier for the field worker to allow the condition of
Theoretically, the optimum number of plots need not be an integer. As a practical matter, of course, the theoretical result
must be rounded to an integer.
258
the crop to influence the precise location of the smaller plot.
(2)
The problem of whether to count plants on the boundary as being in or out of the plot is more
critical with the smaller plot, since the perimeter of a small plot is greater relative to its area
than is the perimeter of a large plot. The general tendency appears to be to include plants that
should be excluded and, thus, to consistently overestimate the yield. For a smaller plot, even a
single plant erroneously included can seriously affect the results.
14.4.4
Locating the plot in the field
Many different procedures have been proposed for locating plots in the field. Whatever method is
used, it is important that the field staff understand clearly how it should be done, and checks should
be made to see that they are following the instructions. Otherwise, subjective bias on the part of the
field worker will almost certainly enter into the procedure.
Ideally it would be desirable to divide the entire field into plots of the size and shape decided upon
and select the required number of plots at random. However, this is not usually practicable. A
method that has been used and is practicable whenever the field is rectangular (or can be
conveniently enclosed in a rectangle) is to locate points at random within the field; the sample plots
are then laid out in a prescribed manner about these points. For each plot to be located, the
procedure is as follows:
(1)
The field worker selects a random number x between 0 and n1, where n1 represents the total
length of one dimension of the field (or of the enclosing rectangle); he selects another random
number y between 0 and n2, where n2 represents the total length of the other dimension. For a
row crop, the first dimension would usually be expressed in terms of the number of rows.6 In
other cases, the dimensions would be expressed in terms of units, such as meters, or in terms of
steps or paces.
(2)
Starting at a predetermined corner, the field worker measures or paces (or counts rows) the
distance x along the appropriate side of the field (or of the enclosing rectangle); then at right
angles to this side, he measures or paces the distance y into the field.
(3)
If the worker is still within the boundaries of the field, he marks the random point (for example,
by digging with his heel and driving a stake). If he is not within the boundaries of the field (he
would, of course, be within the enclosing rectangle), he uses another pair of random numbers
and repeats the process.
(4)
From this point, the field worker lays out the plot. If the plot is to be circular, the random point
should be used as the center. If it is to be triangular or rectangular, the point should be used to
locate a predetermined vertex or corner; this vertex or corner is usually chosen so that the plot
will extend away from the random point in the direction that the worker has been walking.
Figure 2 on the following page illustrates this procedure. In this example the point (x1, y1) falls
inside the field and is accepted. The point (x2, y2) falls outside the field and is rejected. From the
6.
The random number would then be selected between 1 and the total number of rows in the field (n1 )
259
sample point, the plot would usually extend upward and to the right.
One difficulty in this scheme is that it allows plots to overlap field boundaries; any of the several
feasible rules that can be used in such cases present certain problems. Consider, for example, a field
of maize 200 rows wide and 100 meters long. Suppose that the plot is to be 4 rows wide by 6 meters
long. Suppose further that the selected row coordinate is 198 and the length coordinate is 95. From
the point of intersection of the coordinates, the plot would extend 1 meter and 1 row beyond the
boundaries of the field (the plot starts at the end of te 95th meter but includes row 198). Possible
rules that could be adopted to take care of this situation include:
(1)
Instruct the worker to harvest only the partial plot 3 rows by 5 meters and, of course, to record
these dimensions on his form. Using the proper inflation factor, an unbiased estimate of the
yield for this field could be made. In this example, this procedure could be carried out rather
easily; however, if the field were irregular in shape or the plot were circular or triangular, the
worker might find it difficult to estimate the portion of the plot in the field.
(2)
Instruct the worker to think of the rows as being numbered in a circular manner and similarly
the length. Thus, in this example, row 1 would be the fourth row of the plot and the first meter
in each row would be taken to finish out the length of the plot. This, too, would be an unbiased
260
procedure. It would, however, not be practicable for anything except rectangular plots in
regularly shaped fields. Furthermore, it might be difficult to explain it to the average field
worker. Finally it does not fit into the usual concept of a plot as a contiguous piece of land.
(3)
Instruct the worker to restrict his random selection to numbers that will not allow this situation
or, equivalently, to reject plots found to overlap boundaries and select another set of
coordinates. In this case, in the example, he could do the former by restricting the selection for
rows to numbers between 1 and 197 and for length to numbers between 0 and 94. This
procedure is clearly biased since the edges of the field (in the example, the first and last four
rows and the first and last six meters) have less chance of being in the sample than does the
remainder of the field. If the yield tends to be greater or smaller than average around the edges
of the field, estimates of yield based on this method will be biased. However, this is the
simplest procedure. If the borders of the field are small in area relative to the remainder of the
field or if there is no reason to believe that the yield is different along the edges, this method
can be recommended in preference to unbiased but more difficult procedures.
14.4.5
Harvesting procedure
If the plots are small, the field worker will probably do the work himself, cutting the crop and
weighing it in the field. He will than take a small subsample to be sent to the central office for
drying. (It is always a good practice to return the remainder of the produce to the holder.) If plots
are large enough, it may be desirable to harvest them by the same method that the holder will use in
the regular harvest and, if possible, at the same time. This will require his cooperation and help.
14.4.6
Adjustment to actual production
The technician's method of harvesting small plots and processing the produce usually gives a higher
rate of yield than does the normal harvesting procedures used by the holder because of greater
harvesting losses in the normal methods. For some crops, these losses are substantial. In addition, it
is not possible to harvest all plots on or immediately before the harvest date. If the worker waits too
long to start harvesting, he will almost certainly find some fields harvested before he arrives;
consequently, he will need to start harvesting plots in some fields while the crop is immature. Both
of these factors will cause biased estimates if adjustments are not made.
(The harvesting of small plots measures what is often referred to as biological yield.)
One method of adjustment is to select a subsample of fields of known area and harvest them for the
holders, using the normal procedures. This provides a basis for adjusting the data collected from the
harvested plots. A similar method appropriate for some crops (for example, hay crops that are taken
from the field in the form of bales) is to arrange to weigh te entire crop in a subsample of fields as
the holder transports it from the harvested field, but allowing the holder to harvest it whenever and
however he wishes.
Another method of adjustment is to carry out a gleaning operation after harvest to estimate field
losses directly. The estimated field losses per unit area are then subtracted from the estimated
biological yield to get the actual yield. This procedure has the advantage of not requiring the worker
to be present at the harvest--an important consideration since several holders of different sample
fields may all decide to harvest on the same day. Unfortunately, experience has shown that the
261
problems of estimating field losses are fully as great as those of estimating the original biological
production.
As already mentioned, it is desirable that sample plots be harvested as near as possible to the date the
remainder of the field is harvested; however, this cannot always be accomplished for all fields. One
object of a pilot study would be to determine what adjustments, if any, must be made for differences
between these harvesting dates. For many crops, no adjustment is necessary because the crop has
essentially completed its growth before either date and is then in the process only of losing moisture.
An additional adjustment that must be made is for moisture content. A procedure commonly used is
to dry the material from the plots (or a subsample of it) until it is at or very near to 0% moisture
content and then to weigh it. This so-called dry weight can then be adjusted to any moisture content
desired. For many crops, a standard moisture content has been specified. If the dry material is only a
subsample of the plot, a two-step process is required. The material from the entire plot and the
subsample must be weighed separately in the field immediately after cutting. The subsample is then
dried and weighed. The dry weight of the entire plot can then be estimated using the ratio of dry to
wet weight of the subsample.
14.4.7
Operational considerations
Before an extensive program to measure yields objectively can be put into operation, numerous
practical problems must be solved. These include the availability of labor, the availability of
facilities for drying the crops, equipment needs, the need to coordinate the activities of the workers
with the holders' plans for harvesting their crops, etc. The problem of timing can be very difficult,
particularly when the crop is likely to be ready for harvest at the same time over a wide area. As
stated previously, one important reason for conducting pilot studies is to obtain information about
these practical problems.
262
Study Assignment
Problem A.
The sketch below simulates a segment outlined on an aerial photo.
The segment contains a total of 100 hectares divided into categories according to the
of the land. The categories are:
uses made
Crop land:
A1 - maize
A2 - wheat
A3 - other crop land
B - grassland
C - forest
D - wasteland
A grid of 36 dots has been placed over the segment to be used in estimating the
by categories of use.
amount of land
Exercise 1.
Estimate the number of hectares in this segment that are used for crop land.
Exercise 2.
Estimate the number of hectares for grassland.
Exercise 3.
Estimate the number of hectares in forest and wasteland.
Exercise 4.
Estimate the proportion of crop land used for maize. In what basic way does this
estimate differ from those in exercises 1 to 3?
Problem B.
In the sketch above, marks on the east and west boundaries of the segment subdivide the boundaries into
40 units. Using these marks as guides, place two lines at random across the segment parallel to the north
and south boundaries.
Exercise 5.
Use these parallel lines to estimate the quantities estimated in Problem A.
263
Exercise 6.
For each quantity, compile the distribution of the estimates obtained by several trials or several persons.
Problem C.
The sketch below shows a field bordering on a river.
Exercise 7. Draw a circle around the corner corresponding to the unique point according to each
of the definitions given below. Place the appropriate letter (a, b, c) by each circle.
(a) Northwest corner - Identify those boundary points lying farthest north. The northwest
corner is the most western of these points.
(b) Northwest corner - Identify those boundary points lying farthest west. The northwest
corner is the most northern of these points.
©
Southwest corner - Identify those boundary points lying farthest south. The southwest
corner is the most western of these points.
Problem D.
Data on the total area of crop land harvested has been obtained by interview from a simple random sample
(selected without replacement of 24 holdings out of a population of 96 holdings. Objective measurements
have been carried out on a subsample of 8 of these holdings selected at random without replacement. The
data are shown in the table below.
Unit
Hectares of crop land harvested
Interview (Y)
Objective measurement (X)
1
14
14.4
2
79
-
3
46
-
4
112
116.1
5
46
-
6
92
-
7
29
-
8
40
41.9
9
12
-
10
78
80.4
11
66
264
12
43
-
13
39
-
14
91
93.9
15
17
16.8
16
68
-
17
100
-
18
87
-
19
74
75.4
20
64
-
21
78
-
22
40
42.6
23
22
-
24
55
-
Exercise 8.
Estimate the total crop land harvested using the interview data only. Estimate the variance of this
estimated total.
Exercise 9.
Estimate the total crop land harvested using the objective measurement data only. Estimate the variance of
this estimate.
Exercise 10.
Using the formulas given below, estimate the total crop land harvested and the variance of this estimate
using both types of data and ratio estimation.
where
n1 = size of large interview sample
n2 = size of objective measurement subsample
265
266
SELECTED LIST OF REFERENCES
1.
Cochran, William G. Sampling Techniques. Second edition. New York, John Wiley and
Sons. 1963.
2.
Food and Agriculture Organization of the United Nations (FAO). Estimation of Areas in
Agricultural Statistics. Edited by S. S. Zarkovich. Rome, 1965.
3.
Food and Agriculture Organization of the United Nations (FAO). Estimation of Crop Yields.
By V. G. Panse. Rome, 1954.
4.
Food and Agriculture Organization of the United Nations (FAO). By S. S. Zarkovich.
Sampling Methods and Censuses. Rome, 1965. Quality of Statistical Data. Rome, 1966.
5.
Hansen, Morris H.; Hurwitz, William N.; and Madow, William G. Sample Survey Methods
and Theory. New York, John Wiley and Sons, 1953. (Volume I: Methods and
Applications; Volume II: Theory)
6.
Kish, Leslie. Survey Sampling. New York, John Wiley and Sons, 1965.
7.
Kniceley, Maurice R. Probability Sampling for Surveys and Censuses, Course Notes,
PSDP, 1985.
8.
Megill, David J. Preliminary Recommendations for Designing the Master Frame for the
Senegal Intercensal Household Survey Program, U.S. Bureau of the Census, November
1990.
9.
Neter, John and Wasserman, William. Fundamental Statistics for Business and
Economics. Boston, Mass., U.S.A., Allyn and Bacon, 1961.
10.
Sampford, M. R. An Introduction to Sampling Theory. Edinburgh and London, Oliver and
Boyd, 1962.
11.
Sukhatme, Pandurang V. Sampling Theory of Surveys with Applications. Ames, Iowa.
U.S.A., The Iowa State College Press, 1953. New Delhi, India, The Indian Society of
Agricultural Statistics, 1953.
12.
The RAND Corporation. A Million Random Digits. Glencoe, Illinois, U.S.A., The Free
Press, 1955.
13.
United Nations.Statistical Office. Handbook of Household Surveys: A Practical Guide for
Inquiries on Levels of Living. New York, 1964. (Studies in Methods, Series F, No. 10)
14.
U.S. Bureau of the Census. The Current Population Survey Reinterview Program, Some
Notes and Discussion. Washington, D.C., U.S. Government Printing Office, 1963.
(Technical Paper No. 6)
267
15.
U.S. Bureau of the Census. The Current Population Survey--A Report on Methodology.
Washington, D.C., U.S. Government Printing Office, 1963. (Technical Paper No. 7)
16.
U.S. Department of Commerce. Statistical Abstract, Washington, D.C., U.S. Government
Printing Office, 1981, Table 202, P. 123.
17
Yates, Frank. Sampling Methods for Censuses and Surveys. Third Edition. New York,
Hafner Publishing Company, 1960.
268
Annex A
GLOSSARY OF TERMS
Accuracy: Quality of survey result as measured by the closeness of the survey estimate to the
exact or true value being estimated. The accuracy is affected by both sampling error and bias.
Allocation of sample: The method used in determining how the sample should be distributed.
In stratified, cluster sampling, it usually refers to the number of clusters to be allocated to each
stratum and the size of sample selected from each cluster.
Area sample: A type of sample (usually a multistage sample) in which the sampling units are
individual land areas (segments) which can be defined on a map. The segments cover the entire
area to be included in the survey; the segments do not overlap; and, in most applications, the
boundaries of each segment must be clearly defined so they can be recognized and identified by
enumerators in the field. Often the segments are clusters of the units of analysis; for example,
clusters of farms or housing units. Each unit of analysis must be associated with one and only
one segment.
Attribute: See also ‘Characteristic.’ Quality or characteristic. This term is also used in reference
to the proportion of units having a certain characteristic.
Benchmark statistics: Statistics that provide information against which one can measure or
compare changes.
Bias: The difference between the expected value of an estimator and the true population value
being estimated. When the bias is equal to zero, the estimator is said to be “unbiased.”
The term bias is also generally used to designate an effect which deprives a statistical result
of representativeness by systematically distorting it, as distinct from a random error which may
distort on any one occasion but balances out on the average.
Bounded recall: An interview where the respondent is reminded of what he reported in an
earlier interview and is then asked only to report on any new events that occurred subsequent to
the bounding interview. This method is usually used in “income and expenditures surveys, “ in
which at the beginning of the bounded interview (the second and subsequent interviews), the
respondent is told about the expenditures reported during the previous interview, and is then
asked about additional expenditures made since then.
Bounding: Prevention of erroneous shifts of the timing of events by having the enumerator or
respondent supply at the start of the interview (or in a mail survey) a record of events reported in
the previous interview.
Census: Data collection program through which attempts are made to collect information about
every element (person, household, farm, etc.) in the population.
Characteristic: A variable having different possible values for different individual units of
sampling or analysis. In a sample survey, we observe or measure the values of one or more
characteristics for the units in the sample. For example, we observe (or ask about) the area of
land in rice, or the number of cattle on a farm.
Classification Errors: Errors caused by conceptual problems and misinterpretations in the
application of classification systems to survey data.
Cluster sample: A system of sampling in which the units of analysis of the population are
considered as grouped into clusters, and a sample of clusters is selected. The selected clusters
then determine the units to be included in the sample. The sample may include all units in the
selected clusters or a subsample of units in each selected cluster.
Clusters: See also ‘Cluster sample.’ Small groups into which a population is divided to
facilitate the data collection. The groups generally are defined so as to help break a large survey
area into workload-sized chunks and/or to reduce travel and administrative costs. Ideally, the
units in a cluster should be as heterogeneous as possible.
Coding: Coding is a technical procedure for converting verbal information into numbers or
other symbols which can be more easily counted and tabulated.
Coding error: Error that occurs during the coding of sample data. The assignment of an
incorrect code to a survey response.
Coefficient of variation: The relative standard error; that is, the standard error as a proportion of
the magnitude of the estimate. The population coefficient of variation is denote by CV, which is
estimated from a sample by the cv. The coefficient of variation of estimates, such as the mean,
proportion, or total, is denoted by CV(). The estimate of interest is then placed inside the
parenthesis. If
is an estimate of the population parameter 2 , then
coefficient of variation of the estimate and
denotes the true
is an estimate of
Conditioning effect: The effect on responses resulting from the previous collection of data
from the same respondents in recurring surveys.
Confidence interval: A range above and below the estimated value which may be expected to
enclose the true value with a known probability, assuming no bias.
Consistent estimate: An estimate of a type that (while possibly biased) approaches more
and more closely the true value being estimated as the size of sample increases, the most
common example being a ratio estimate.
Content error: Error of observation or objective measurement, of recording, of imputation, or of
other processing which results in associating a wrong value of the characteristic with a specified
unit.
Coverage error: The error in an estimate that results from (1) failure to include in the frame all
units belonging to the defined population; failure to include specified unis in the conduct of the
survey (undercoverage), and (2) inclusion of some units erroneously either because of a
defective frame or because of inclusion of unspecified units or inclusion of specified units more
than once, in the actual survey (overcoverage).
Cost function: A mathematical expression showing the cost of conducting a survey in terms of
the sample sizes and unit costs.
Editing: Preliminary step in which the responses are inspected, corrected and sometimes
precoded according to a fixed set of rules.
Efficiency: A comparative measure of one sample design relative to another with respect to
amount of precision produced per unit of cost for a given sample size.
Element: See Unit of analysis.
Elementary Unit: See Unit of analysis.
Estimate: A numerical quantity calculated from sample data and intended to provide
information about an unknown population value.
Estimating formula: A mathematical formula used to calculate an estimate.
Estimator: See Estimating formula.
Expected value. The average value of the sample estimates over all possible samples.
Finite Population Correction Factor (fpc): It’s a factor that corrects the value of the variance
when the sample size is large with respect to the size of the population.
Frame: A list of units which make up a population. The frame consists of previously available
descriptions of the objects or material related to the physical field in the form of maps, lists,
directories, etc., from which sampling units may be constructed and a set of sampling units
selected; and also information on communications, transport, etc., which may be of value in
improving the design for the choice of sampling units, and in the formation of strata.
Imputation: The process of developing estimates for missing or inconsistent data in a survey.
Data obtained from other units in the survey are usually used in developing the estimate.
Independent information: Data known in advance or simultaneously with the survey, which are
not based on the survey but may be used to improve the survey design. Such data may be used
for stratifying, deciding on the probabilities of selection, or estimating the final results from the
sample data.
Interviewer bias: Bias in the responses which is the direct result of the action of the interviewer.
Interviewer error: Errors in the responses obtained in a survey that are due to actions of the
interviewer.
Interviewer variance: The component of the nonsampling variance which is due to the
different ways in which different interviewers elicit or record responses.
Intracluster or intraclass correlation: A measure used to estimate the degree of homogeneity
(or heterogeneity) between elementary units within a cluster. It can be used to determine how
satisfactorily clusters have been formed. For example, the closer the value is to zero (or
negative) the more unlike the elementary units are and, consequently, the better we’ve done to
form clusters. We also could use this to evaluate how effectively we have created the strata.
Item nonresponse: The type of nonresponse in which some questions, but not all, are answered
for a particular unit. The type of nonresponse in which a question is missed for an interviewed
unit.
List: A population in which the sampling units have been numbered or otherwise identified; the
list of units can be the basis for the selection of a sample. See also Sampling Frame.
Mean square error: A measure of the accuracy of an estimate or the extent to which an estimate
from sample data differs from the true population value being estimated. If the estimates are
unbiased, the mean square error is equivalent to the variance.
Muitiframe sampling: The use of two or more sampling frames to select a survey sample.
Generally necessary when the usual frame, such as an address register, will not adequately cover
the population and/or there are unique or unusually large units that must appear in the sample.
Multistage sampling: The most common type of cluster sampling. In this method, a sample of
clusters is selected; and then a subsample of units selected within each sample cluster. If the
subsample of units is the last stage of sample selection, it is called a two-stage sample design
(although each such unit may contain more than one unit of analysis, as in an area sample). If the
subsample is also a cluster from which units are again selected, it is a three-stage design, or fourstage design, etc.
Noninterview: The type of nonresponse in which no information is available from occupied
sample units for such reasons as: not at home, refusals, incapacity and lost questionnaires.
Noninterview adjustment: A method of adjusting the weights for interviewed units in a survey
to the extent needed to account for occupied sample units for which no information was
obtained.
Nonsampling error: The error in an estimate arising at any stage in a survey from such sources
as varying interpretation of questions by enumerators, unwillingness or inability of respondents
to give correct answers, nonresponse, improper coverage, and other sources exclusive of
sampling error. This definition includes all components of the Mean Square Error (MSE) except
sampling variance.
Optimum allocation of sample: Refers to the selection of a sample in such a way as to produce
the minimum standard error for a constant sample size or for a constant cost. It is used in both
stratified sampling and cluster sampling.
Overhead costs: Costs that are fixed and do not affect overall costs. These do not enter into
designing the sample. Included are such costs as administrative, rent, equipment, printing, and
utilities.
Parameters: These are values descriptive of the population distribution and calculated from all
population units. They are estimated from a sample, the estimates being called statistics. For
normal distributions the parameters are the mean and standard deviation.
Population: Any clearly defined set of units (or elements) for which estimates are to be made.
The elements can be persons, farms, households, blocks, counties, businesses, and so on. Most of
our discussion deals with sampling from a finite population, containing a finite number of
elements.
Precision: Difference between the sample estimate and a complete count value collected under
the sample conditions. This is measured by the sampling error or relative sampling error.
Primary sampling unit (PSU): The units making up the sampling frame for the first stage of a
multistage sample.
Probability of selection: The chance each unit has of being selected in the sample. This is
known prior to sample selection.
Probability proportionate to size (PPS): A method of sample selection in which units are
selected with unequal probability of selection, the probability for each unit being proportionate
to a measure of size. The measure of size for a unit is a number assigned to that unit in advance
of selection, which is believed to be highly correlated with the statistics to be estimated.
Probability proportionate to size is frequently abbreviated to PPS.
Proportion: Measure of the relative frequency of units that possess a certain characteristic in the
population or sample.
Proportionate stratified sampling: A system of selecting a stratified sample in which the same
probability of selection is used in each stratum.
Reliability: The confidence that can be assigned to a conclusion of a probabilistic nature.
Response bias: The difference between the average of the averages of the responses over a large
number of independent repetitions of the census and the unknown average that could be
measured if the census were accomplished under ideal conditions and without error. The
difference between average reported value over trials and true values. It is a combined bias as
algebraic sum of all bias terms representing diverse source of biases.
Response error: The part of the nonsampling error which is due to the failure or the respondent
to report the correct value (respondent error) or the interviewer to record the value correctly
(interviewer error). It includes both the consistent response biases and the variable errors of
response which tend to balance out.
Response variance: That part of the response error which tends to balance out over repeated
trials or over a large number of interviewers. The variance among the trial means over a large
number of trials. The response variance of a survey estimator is the sum of the simple response
variance and the correlated response variance.
Response variance, correlated: The correlated response variance is the contribution to the total
variance arising from nonzero correlations (in the sense of the distribution of measurement
errors) among the response of sample units. The contribution to the total response variance from
the correlations among response deviations.
Response variance, uncorrelated: The sample response variance contribution to the total
variance arises from the variability of each survey response about its own expected value. In
terms of a simple random sampling design, the simple response variance is the population mean
of the variances of each population unit. The variance of the individual response deviations
over all possible trials. The basic trial-to-trial variability in response, averaged over the elements
in the population.
Rotation bias: A type of bias that occurs in panel surveys which consist of repeated interviews
on the same units. Although these surveys are designated so that the estimates of a characteristic
are expected to be nearly the same for each panel in the survey, this expectation has not been
realized. For example, an estimate from a panel that is in the survey for the first time may differ
significantly from estimates from the panels that have been in the survey longer. The downward
tendency in the value of the characteristics reported if the observation of the same units is
continued over a longer period of time. For example, it was found in expenditure surveys that the
average expenditure per item per person is usually higher in the first week of the survey than in
the second or the third.
Sample: A subset of a population. As used in these chapters, it always refers to a probability
sample that is, a sample in which each element in the population has a known probability of
selection.
Sample design: The sampling plan and estimation procedures.
Sample Survey: A data collection program through which information is collected from a
probability-selected subset of the population.
Sampling bias: That part of the difference between the expected value of the sample estimator
and the true value of the characteristic which results from the sampling procedure, the estimating
procedure, or their combination.
Sampling Distribution: The distribution of values of a statistic calculated from all possible
samples of the same size from the same population.
Sampling Error (of Estimator): That part of the error of an estimator which is due to the fact
that the estimator is obtained from a sample rather than a 100 percent enumeration using the
same procedures. The sampling error has an expected frequency distribution for repeated
samples, and the sampling error is described by stating a multiple of the standard deviation of
this distribution. That part of the difference between a population value and an estimator
thereof, derived from a random sample, which is due to the fact that only a sample of values is
observed; as distinct from errors due to imperfect selection, bias in response or estimation, errors
of observation and recording, etc. The totality of sampling errors in all possible samples of the
same size generates the sampling distribution of the statistic which is being used to estimate the
parent value.
Sampling frame: The totality of sampling units from which a sample is to be selected. The
frame may be a listing of persons or housing units; a file of records; a generalization about the
population based on information contained in a sample.
Sampling Variance: It is denoted by
where denotes any estimator. The term sampling
variance refers to the variance of an estimator. For a simple random sample, the variance of the
mean is given by:
Sampling plan: The actual procedure describing how sample units are to be selected and from
which sampling frames.
Sampling unit: The units to be selected. These may or may not be the same as the units of
analysis. For example, to obtain information on persons, one might use a complete listing in a
Census, or a register, and select a sample of persons directly. However, one could also select a
sample of households and include in the survey all persons in the selected households. Similarly,
one could select complete buildings, and include all persons in the sample buildings. The choice
of the most efficient sampling unit is an important consideration in the design of a survey.
Sampling with replacement: A sample obtained by first selecting one element of the
population, replacing it, then making a second selection and replacing it before making the third
selection, etc., until n selections have been made. With this method of selection, a particular unit
can be included more than once in the sample--in fact, up to n times.
Sampling without replacement: A sample obtained by selecting one element of the population
and, without replacing it, selecting one of the remaining elements; then continuing this process
until n different selections have been made. With this method, a unit can be included only once
in any sample.
Self-weighting sample: A sample in which every element in the population has the same chance
of selection, although unequal probabilities may have been used at various stages of sampling.
For example, clusters may have been selected with PPS; then the sampling within a selected
cluster is done in such a way as to give each element in it the same chance of being in the sample
as the elements to be selected in other clusters.
Simple random sample (also called unrestricted random sample): The simplest type of
sampling system. For a sample of size n, each of the possible combinations of n elementary units
that may be formed from a population of N units has the same chance of selection as every other
combination of n units. Moreover, every element will have the same chance of selection as every
other element (chapters 2, 3, 4, and 5).
Standard deviation: The standard error of a simple random sample of size 1.
Standard error: A measure of the extent to which estimates from various samples differ from
their expected value. With a reasonably large sample, the distribution of sample results for all
possible samples is approximately the normal distribution, and probability statements can be
made about how close the sample can be expected to come to the expected value--the
probabilities being expressed in terms of the standard error. The standard error usually is
expressed by the Greek letter F or S. See also Variance.
Statistic: A quantity computed from sample observations of a characteristic, usually for the
purpose of making an inference about the population. The characteristic may be any variable
associated with a member of the population; such as age, income, employment status, etc. The
quantity may be a total, an average, a median, or other percentile; it may also be a rate of change,
a percentage, a standard deviation or any other quantity whose value we wish to estimate for the
population.
Statistical Inference: A statistical inference is a decision, estimate, prediction, or square of the
coefficient of variation.
Stratification: The process of dividing a population into groups for the purpose of selecting a
separate sample from each group. Each group is usually made as internally homogeneous as
possible. The groups are called strata with each one referred to as a stratum.
Stratified sampling: The method of sampling from a universe which has been stratified. At least
one sample unit must be selected from each stratum, but at least two units are needed to calculate
variances. Probabilities of selection can be different from stratum to stratum.
Systematic error: As opposed to a random error, an error which is in some sense biased, that is
to say, has a distribution with mean (or some equally acceptable measure of location) not at zero.
Systematic sampling: A method of sample selection in which the population is listed in some
order and every kth element is selected for the sample.
Telescoping: The tendency of the respondent to allocate an event to a period other than the
reference period (also called border bias). A telescoping error occurs when the respondent
misremembers the duration of an event. While one might imagine that errors would be
randomly distributed around the true duration, the errors are primarily in the direction of
remembering an event as having occurred more recently than it did. This is due to the
respondent’s wish to perform the task required of him. When in doubt, the respondent prefers to
give too much information rather than too little.
Total Error: The difference between an estimate and its true value in the population measured as
the root mean square error, that is, the square root of the sum of variable error squared and bias
squared.
True Value. The value that would be obtained if no mistakes were made or error existed.
Ultimate cluster: The totality of units included in the sample from a primary unit. Even if the
sample is obtained using different stages of selection, the Primary Sampling Unit becomes the
ultimate cluster.
Unbiased estimate: A type of estimate having the property that the average of such estimates
made from all possible samples of a given size is equal to the true value.
Unbouded recall: Ordinary type of recall, where respondents are asked for expenditures made
since a given date and no control is exercised over the possibility that respondents may
erroneously shift some of their expenditures reports into or out of the recall period.
Unit of Analysis: A unit for which we wish to obtain statistical data. The units may be persons,
households, farms or business firms; they may also be products resulting from some machine
process, etc.
Universe: See Population.
Variance: The square of the standard error; it is usually written as S2 with a subscript to indicate
the statistic to which it refers. The term is usually written without a subscript for the square of
the standard deviation. Where there is any possibility of confusion, sampling variance is used
for the square of the standard error, and population variance for the square of the standard
deviation.

Training Manual on Sample Design for Surveys Draft 2006

Transcription

Similar documents

Email: [email protected]

Chapter 12: Sample Surveys Terms and Notes Sample:

6734-2 CS6000 Leaflet

7a. Fish Diversity Survey in Teluk Rubiah (English - pdf

South Towne Expo Center - The Great Salt Lake Business Conference

South Towne Expo Center and Salt Palace

Sample size Student Learning Centre Semester 2

GilAir Plus Next Generation Universal Air Sampling Pumps