Training Manual on Sample Design for Surveys Draft 2006
Transcription
Training Manual on Sample Design for Surveys Draft 2006
Training Manual on Sample Design for Surveys Draft 2006 Note: This is a draft training document developed by the International Programs Center of the US Bureau of the Census for training in developing countries. CHAPTER 1 GENERAL NATURE OF SAMPLE SURVEYS ________________________________________________________________________________ 1.1 ROLE OF SAMPLING IN STATISTICAL THEORY AND METHODS In a broad sense, sampling theory can be considered as coextensive with modern statistical methods. Almost all of the modern developments in statistics relate to the inferences that can be made about a population when information is available from only a sample of the elements of the population. Some of the ways in which this is reflected in statistical programs are mentioned below. 1.1.1 Survey Work In most survey work, the population consists of all persons (or housing units, households, industrial establishments, farms, etc.) in a city or other area. Information is obtained or desired from a sample of the population, but inferences are required on characteristics of the whole population. 1.1.2 Design and Analysis of Experiments In the design and analysis of experiments, the population represents all possible applications of several alternative techniques which can be used. For example, the experiment may be agricultural, in which a number of fertilizers are being tested. The population is infinite because it represents the use of the fertilizers in all possible farms over all time. The problem is to design experiments so that the maximum amount of information can be made available for inferences about the full population, estimated from a sample of limited size. 1.1.3 Quality Control In the application of quality control methods in an industrial establishment, for example, the population is all of the products coming out of a machine. Inferences are needed on how well the products conform to specifications. The term "quality control" is also applied to a sample check on the quality of field work done in a sample survey; the sample check is carried out after the actual survey is completed. Office operations such as editing and coding are also subject to quality control; a sample of the work is checked to determine if it meets acceptable standards. 1.2 CONTENTS OF CHAPTERS These chapters will be limited to one aspect of sampling; that is, sampling application in survey work. They will deal mainly with principles of sampling from the common sense rather than the mathematical viewpoint, though mathematics cannot be entirely avoided. The emphasis will be on the methods of sampling that can be used under different conditions. The formulas will be presented, some without mathematical proof, but with information on how they should be used. Two types of examples will be used to illustrate the formulas and methods: (a) simple examples to make the techniques clear, and (b) examples taken from actual surveys to show the realistic applications of the methods discussed. First there will be a general discussion of the subject as a whole, including the nature of probability sampling, and choices of sampling units and sampling frames. Then we shall describe the types of common sample designs--simple random sampling, stratified sampling, and cluster sampling. The features of these designs and the methods of sample selection will be discussed. The different methods of estimating the characteristics of the population from the sample results will also be treated, as well as how to determine the size of sample required for a particular degree of reliability and how to calculate sampling errors. We shall also discuss the problem of estimating, from a sample, the results that would have been obtained from a full census using the same questionnaire, enumeration or interview procedures, supervision, etc. These are aspects of the problem of sampling error. There are, of course, nonsampling errors that arise from wrong responses to questions, or from poorly worded questions. These are present in complete censuses as well as in sample surveys. Although the lectures are not primarily concerned with such nonsampling errors, they may be very important. In fact, nonsampling errors often represent more serious limitations on the use of statistics than sampling errors. 1.3 REASONS FOR THE USE OF SAMPLES There are six basic reasons for the use of samples: (1) A sample may save money (as compared with the cost of a complete census) when absolute precision is not necessary. (2) A sample saves time, when data are desired more quickly than would be possible with a complete census. (3) A sample may make it possible to concentrate attention on individual cases. (4) In industrial uses, some tests are destructive (for example, testing the length of time an electric bulb will last) and can only be performed on a sample of items. (5) Some populations can be considered as infinite, and can, therefore, only be sampled. A simple example is an agricultural experiment for testing fertilizers. In one sense, a census can be considered as a sample at one instant of time of an underlying causal system which has random features in it. (6) Where nonsampling errors are necessarily large, a sample may give better results than a complete census because nonsampling errors are easier to control in smaller-scale operations. 1.4 ILLUSTRATIONS OF SAMPLING The following illustrate the use of sampling in various situations. 2 1.4.1 Limited Funds The use of a sample survey when limited funds are available for collecting information is well known. Sampling may also be used to save money in tabulation. For example, in the 1950 Census in the United States most of the data were collected on a 100-percent basis. However, many tabulations were made on a sample basis (20% or 3-1/3%) for special detailed classifications to save the cost of tabulating 150,000,000 individual records. The 1960 Census utilized sampling procedures to an even greater extent in both the collection and the tabulation of data. 1.4.2 Time Saving Other examples from the 1950 census in the United States illustrate how samples can be used to save time. The enumeration of the census was taken in April 1950. The time required for processing the results was such that publication of the results was expected to start in 1951 and continue through 1952. A sample of the census results was selected for quick processing and tabulation, and preliminary results were published on the basis of this sample. These results were issued 1 to 2 years earlier than the complete census results. 1.4.3 Concentration on Particular Cases Some surveys require such intensive and time-consuming interviews that it is impossible to consider them on any basis except a sample basis. Moreover, the use of sampling permits particular attention to be given to a limited number of cases. Examples are family budget studies and comprehensive studies of health conditions. 1.4.4 Sampling for Time Series Information may be required for a time series when data are available only for particular periods of time and results are needed promptly. The series may be one of economic activity in the country, with figures available only on a yearly or monthly basis, or it may be one of producing a learning curve for which only occasional tests are possible. 1.4.5 Controlling Nonsampling Errors An interesting example arose in the 1950 United States Census of a case where the relationship between nonsampling and sampling errors made sample results preferable to complete census results. The United States has conducted a monthly sample survey of the labor force since 1940. In 1950, it was based on a sample of 20,000 households. The information obtained in the 1950 complete census also included labor force status. When the results of the census became available, it was clear that the figures for both unemployed and employed persons were quite different from those estimated from the labor force sample survey; the differences were far beyond what could be expected on the basis of the sampling errors. The problem of reporting in the census introduced much greater error than the sampling error of the monthly survey (this greater error was caused by the use of enumerators who, for the most part, were inexperienced in interviewing). Users of census data were advised, therefore, to use the sample results as the more reliable national 3 statistics on the labor force. 1.5 LIMITATIONS OF SAMPLING Under certain conditions, the usefulness of sampling becomes questionable. Three principal conditions can be mentioned. (1) If data are needed for very small areas, disproportionately large samples are required, since precision of a sample depends largely on the sample size and not on the sampling rate. In this case, sampling may be almost as expensive as a complete census. (2) If data are needed at regular intervals of time, and it is important to measure very small changes from one period to the next, very large samples may be necessary. (3) If there are unusually high overhead costs connected with a sample survey, caused by work involved in sample selection, control, etc., sampling may be impractical. For example, in a country with many small villages it may be more economical to enumerate all the households in the sample villages than to enumerate a sample of households within the sample villages. For office processing, however, a sample of the enumerated households may be used to reduce the work and costs of producing tabulations. 4 CHAPTER 2 CRITERIA AND DEFINITIONS _______________________________________________________________________________________________________________ 2.1 CRITERIA FOR THE ACCEPTABILITY OF A SAMPLING METHOD It has been demonstrated repeatedly in practical applications that modern sampling methods can provide data of known reliability on an efficient and economical basis. However, although a sample includes only part of a population, it would be misleading to call a collection of numbers a "sample" merely because it includes part of a population. To be acceptable for statistical analysis, a sample must represent the population and must have measurable reliability. In addition, the sampling plan should be practical and efficient. 2.1.1 Chance of Selection for Each Unit The sample must be selected so that it properly represents the population that is to be covered. This means that each unit (farm, household, person, or whatever unit is being sampled) must have a nonzero probability (chance) of being selected. 2.1.2 Measurable Reliability It should be possible to measure the reliability of the estimates made from the sample. That is, in addition to the desired estimates of characteristics of the population (totals, averages, percentages, etc.) the sample should give measures of the precision of these estimates. As we shall see later, these measures of precision can be used to indicate the maximum error that may reasonably be expected in the estimates, if the procedures are carried out as specified, and if the sample is moderately large. The estimation of precision is not possible unless the selection is carried out so that the chance of selection of each unit is known in advance and random sampling is used. 2.1.3 Feasibility A third characteristic is that the sampling plan must be practical. It must be sufficiently simple and straightforward so that it can be carried out substantially as planned; that is, the sampling theory and practice will be the same. A plan for selecting a sample, no matter how attractive it may appear on paper, is useful only to the extent that it can be carried out in practice. When the methods actually followed are the same (or substantially the same) as specified in the sampling plan, then known sampling theory provides the necessary measures of reliability. In addition, the measures of reliability computed from the survey results will serve as powerful guides for future improvement in important aspects of the sample design. 5 2.1.4 Economy and Efficiency Finally, the design should be efficient. Among the various sampling methods that meet the three criteria stated above, we would naturally choose the method which, to the best of our knowledge, produces the most information at the smallest cost. Although this is not an essential feature of an acceptable sampling plan, it is clearly a highly desirable one. It implies that the most effective possible use will be made of all available facilities and resources, such as maps, other statistical data, personal knowledge, sampling theory, etc. We shall consider only sampling methods that conform to the above criteria. We shall present basic theory for various alternative designs which are possible, and methods of measuring their precision. We shall also stress practical methods of application and considerations of efficiency. 2.2 DEFINITIONS OF TERMS 2.2.1 Statistical Survey The statistical survey is an investigation involving the collection of data. Observations or measurements are taken on a sample of elements for making statistical inferences (see Glossary in Annex A) about a defined group of elements. Surveys are conducted in many ways. 2.2.2 Unit of Analysis The unit of analysis is the unit for which we wish to obtain statistical data. The most common units of analysis are persons, households, farms, and business firms. They may also be products coming out of some machine process. The unit of analysis is frequently called an element of the population. There may be more than one unit of analysis in the same survey; for example, households and persons; or number of farms and hectares (or acres) harvested. 2.2.3 Characteristic A characteristic is a general term for any variable or attribute having different possible values for different individual units of sampling or analysis. In a sample survey, we observe or measure the values of one or more characteristics for the units in the sample. For example, we observe (or ask about) the area of land for rice crop, the number of cattle on a farm, the age and sex of a person, the number of children per family, etc. So, we observe a unit, but we measure several characteristics of that unit. 2.2.4 Population or Universe The population or universe is the entire group of all the units of analysis whose characteristics are to be estimated. The chapters in this sampling manual will deal primarily with a finite population, having N units. 6 2.2.5 Probability Sample A probability sample is a sample obtained by application of the theory of probability. In probability sampling, every element in a defined population has a known, nonzero, probability of being selected. It should be possible to consider any element of the population and state its probability of selection. 2.2.6 Sampling with Replacement and Without Replacement A simple way of obtaining a probability sample is to draw the units one by one with a known probability of selection assigned to each unit of the population at the first and each subsequent draw. The successive draws may be made with or without replacing the units selected in the preceding draws. The former is called the procedure of sampling with replacement, and the latter, sampling without replacement. 2.2.7 Simple Random Sampling Simple random sampling is a special case of probability sampling, sometimes called unrestricted random sampling. It is a process for selecting n sampling units one at a time, from a population of N sampling units so that each sampling unit has an equal chance of being in the sample. Every possible combination of n sampling units has the same chance of being chosen. Selection of one sampling unit at a time with equal probability may be accomplished by either sampling with replacement or without replacement. Almost, if not all, samples are selected without replacement. Using a table of random numbers to select the units satisfies this definition of simple random sampling. 2.2.8 Sampling Frame The totality of the sampling units from which the sample is to be selected is called the sampling frame. The frame may be a list of persons or of housing units; it may be a subdivided map, or it may be a directory of names and addresses stored in some kind of electronic medium, such as a file in a hard disk or a data base. 2.2.9 Parameter A parameter is a quantity computed from all values in a population set. That is, a parameter is a descriptive measure of a population. For example, consider a population consisting of N elements. Then the population total, the population average or any other quantity computed from measurements including all elements of the population is a parameter. The objective of sampling is to estimate the parameters of a population. 2.2.10 Statistic A statistic is a quantity computed from sample observations of a characteristic, usually for the purpose of making an inference about the characteristic in the population. The characteristic may be any variable which is associated with a member of the population, such as age, income, employment status, etc.; the quantity may be a total, an average, a median, or other quantiles. It may also be a rate of change, a percentage, a standard deviation, or it may be any other quantity whose value we wish to estimate for the population. 7 Note that the term statistic refers to a sample estimate and the term parameter refers to a population value. Note on Quantiles: What is a quantile? If a set of data is arranged in order of magnitude, the middle value (or the arithmetic mean of the two middle values) which divides the set into two equal parts is the MEDIAN. By extending this idea we can think of those values which divide the set into four equal parts. These values, denoted by Q1, Q2 and Q3 are called the first, second and third quartiles respectively, the value of Q2 being equal to the median. Similarly the values which divide the data into ten equal parts are called deciles and are denoted by D1, D2, ... D9, while the values dividing the data into one hundred equal parts are called percentiles and are denoted by P1, P2, ... P99. The 5th decile and the 50th percentile correspond to the median. The 25th and 75th percentiles correspond to the first and third quartiles, respectively. Collectively, quartiles, deciles, percentiles and other values obtained by equal subdivisions of the data are called quantiles. 2.2.11 Independent Information Independent information consists of data that are known in advance of or simultaneously with the survey which are not based on the survey but are used to improve the survey design. Such data may be used for purposes of stratification, for determining the probabilities of selection, or in estimating the final results from the sample data. The data must be of good, known quality. 2.2.12 Estimate and Estimator An estimate is a numerical quantity computed from sample observations of a characteristic and intended to provide information about an unknown population value. An estimator is a mathematical formula or rule which uses sample results to produce an estimate for the entire population. For example, the sample average, is an estimator. It provides an estimate of the parameter, the population average, That is, the sample average is an estimate of the population average. Therefore, the estimator refers to a mathematical formula. When numbers are plugged into the formula, an estimate is produced. However, in common statistical language, the words estimate and estimator are used interchangeably. 8 2.2.13 Probability of Selection The probability of selection is the chance that each unit in the population has of being included in the sample. Probability values range from 0 to 1, inclusive. 2.2.14 Random Variables A random variable is a variable which, by chance, can be equal to any value in a specified set. The probability that it equals any given value (or falls between two limits) is either known, can be determined, or can be approximated or estimated. A chance mechanism determines the value which a random variable takes. For example, in flipping a coin, we can define the random variable X which can take the value 1 is the coin lands ‘heads’ and the value 0 if the coin lands ‘tails’. Therefore, the variable X, as was just defined, can take either one of two values after the coin is flipped. 2.2.15 Probability Distribution The probability distribution gives the probabilities associated with the values which a random variable can equal. If there are N values that a random variable X can take, say X1, X2, ... ,XN, then there are N probabilities associated with the Xi's values, namely P1, P2, ... ,PN. The probabilities and the values the random variable takes constitute the probability distribution of X. 2.2.16 Illustration The 2000 U.S Census of Population and Housing found that 281,421,906 persons lived in 105,480,101 households of which 71,787,347 are Family Households and 33,692,754 are NonFamily Households. Table 2.1 below shows the distribution of households by type1. These data show that 68.1% of all households are of the “family” type and 31.9% are of the “nonfamily” type. Now if we were to pick a household at random, what is the probability that we would pick a family household? If each household, large or small, is equally likely to be picked, then there is a .681 probability of picking a family household. 1 The Census Bureau defines a household as persons who occupy a house, apartment, or other separate living quarters. One of the tests in determining a household is that there are complete kitchen facilities for the exclusive use of the occupants. People who are not in households live in group quarters including rest homes, rooming houses, military barracks, jails, and college dormitories. 9 Table 2.12. Type of U. S. Households, 2000 Type of Household Number of Households Family Households Married Couple Female Householder, no husband present Male Householder, no wife present 71,787,347 54,493,232 12,900,103 4,394,012 68.1 51.7 12.2 4.2 Nonfamily Households One Person Two or M ore People 33,692,754 27,230,075 6,462,679 31.9 25.8 6.1 105,480,101 100.0 Total Households 2 Fraction of Total Households Source: U.S. Census Bureau, 2000 Census of Population and Housing. 10 Exercises 2.1 In order to select a sample of the total population of a city, a sample is selected from the telephone directory for that city and the families of the persons selected are interviewed. Does this satisfy the criteria for acceptability? Explain. 2.2 In order to determine the population of a city where all children of school age attend school, a sample of school children is drawn and their families are interviewed. Give two reasons why this does not meet the criteria for acceptability. (Think of families who have more than one child in school and families that don’t have any children.) 2.3 Suppose that you were using sampling to estimate the total number of words in a book that contains illustrations. (a) Is there any problem of definition of the population? (b) What are the pros and cons of (1) using the page, (2) the line as a sampling unit? 2.4 Suppose that you work for a major public opinion pollster and you wish to estimate the proportion of adult citizens who think the President is doing a good job in heading the nation's economy. Clearly define the population you wish to sample. 2.5 The problem of finding a frame that is complete and from where a sample can be drawn is often an obstacle. What kinds of frames might be tried for the following surveys? Do the frames have any serious weakness? (a) A survey of stores that sell luggage in a large city. (b) A survey of the kinds of articles left behind in subways or on buses. © A survey of persons bitten by snakes during the last year. (d) A survey to estimate the number of hours per week spent by family members watching television. 11 CHAPTER 3 SIMPLE RANDOM SAMPLING SAMPLING DISTRIBUTION ______________________________________________________________________________ 3.1 INTRODUCTION In this chapter, we shall introduce the concept of the sampling distribution of a statistic, probably the most basic concept of statistical inference. We shall concentrate only on the sample mean and its sampling distribution. We shall first introduce certain definitions and relationships of terms needed for the sampling distribution. 3.2 EXPECTED VALUE The expected value is the average value for a single characteristic over all possible samples. Mathematically, we define the expected value (or mean) of a random variable Y as follows: where and the Greek letter E is used to indicate the sum of the products of all possible values of y and their associated probabilities P(y). The small y denotes a particular value of Y. The expected value is a weighted average of the possible outcomes, with the probability weights reflecting the likelihood of occurrence of each outcome. Thus, the expected value should be interpreted as the long-run average value of Y, if the frequency with which each outcome occurs is in accordance with its probability. For example, consider the tossing of a die in which each outcome (numbers 1 to 6) have the same probability of occurring, 1/6 (assuming the die is not biased). If Y is used to represent the number that appears when we throw a die, the expected value of Y is given by: E(y) = 1 (1/6) + 2 (1/6) + 3 (1/6) + 4 (1/6) + 5 (1/6) + 6 (1/6) = 3.5 The expected value of Y is not the most likely or the most typical value of Y. It is the long-run average value of Y, if we repeatedly perform the experiment that originates the outcomes. Some throws of the die will produce numbers below 3.5 and others above 3.5. The average of these different numbers, in the long rum, will be 3.5. 12 3.2.1 Unbiased Estimate An unbiased estimate has the property that the average of all the estimates obtained from all possible samples of a given size is equal to the true value. Mathematically, an estimate is unbiased if the expected value of the estimate is equal to the parameter being estimated. For example, if is an estimate of the parameter 2 and if then is an unbiased estimate of 2. Otherwise, That is, the bias is the difference between the expected value of an estimate and the true population value (parameter) being estimated. 3.2.2 Consistent Estimate An estimate is consistent if its values tend to concentrate increasingly around the true value as the sample size increases. In other words, the estimate assumes the population value with probability approaching unity as the sample size tends to infinity. This definition of consistency strictly applies to estimates based on samples drawn from an infinite population. We use the following definition in the case of a finite population. An estimate if it takes the population value when n = N. is said to be a consistent estimate of the parameter Y In the next section we will see that for simple random sampling the sample mean is an unbiased and consistent estimate of the population mean as the sample size increases. 3.3 SAMPLING DISTRIBUTION A sampling distribution is the probability distribution of all possible values that an estimate might take under a specified sampling plan. In this section we will show by examples that the sample average (mean) is both an unbiased and a consistent estimate of the true population average. Let us first present the idea of a sampling distribution of the mean by actually listing all possible random samples of size n = 2 which can be drawn from a hypothetical population of N = 5 housing units (HUs) shown in Table 3.1. We wish to estimate the average household (HH) size of these HUs from a sample. Table 3.1 Household Size per Household HU 1 2 3 4 5 HH Size 3 5 7 9 11 The total number of persons in the population is: 13 The average number of persons per household (or average household size) is: If we take a sample of size 2 from this population, there are 3 and 5, 5 and 7 3 and 7, 5 and 9 3 and 9, 5 and 11 3 and 11 7 and 9 7 and 11 possibilities, and they are: 9 and 11 The means of these samples are 4, 5, 6, 7, 6, 7, 8, 8, 9, and 10, respectively, and if sampling is random so that each sample has the probability 1/10, we obtain all the possible samples of size two HUs from a population of 5 HUs, as shown in Table 3.2. Table 3.3 presents the sampling distribution of the mean. Table 3.2 Samples of Two HUs from a Population of 5 HUs. SAMPLES OF SIZE VALUE OF n=2 PROBABILITY p(y) 3,5 4 1/10 3,7 5 1/10 3,9 6 1/10 3,11 7 1/10 5,7 6 1/10 5,9 7 1/10 5,11 8 1/10 7,9 8 1/10 7,11 9 1/10 9,11 10 1/10 14 Table 3.3 Sampling Distribution of the Mean Mean Probability 4 1/10 5 1/10 6 2/10 7 2/10 8 2/10 9 1/10 10 1/10 An examination of this sampling distribution reveals some pertinent information relative to the problem of estimating the mean of the given population using a random sample of size 2. For instance, we see that corresponding to 6, 7, or 8, the probability is 6/10 that a sample mean will not differ from the population mean (which is 7) by more than 1, and that corresponding to 5, 6, 7, 8, or 9, the probability is 8/10 that a sample mean will not differ from the population mean by more than 2. Further useful information about this sampling distribution of the mean can be obtained by calculating its expected value as follows: We may also use Table 3.2 to compute the expected value of Note that the same results would be obtained for samples of any size. Recall the definition of the expected value, which is the average of a single characteristic over all possible samples. With simple random sampling the sample mean is an unbiased estimate of the true mean. 15 We will now compare the distribution of the sample estimates to show that: (1) As the sample size increases, the means of the samples tend to concentrate more and more around the true average value. In other words, the estimates tend to become more and more reliable as the sample size increases. (2) The percentage distributions of the sample estimates can be used to predict the chance of obtaining a sample estimate within specified ranges of the true value. To see the above statements, consider a hypothetical population of 12 individuals. We wish to make different estimates from a sample of 1,2,3,4,5,6 and 7 individuals. The full population is shown in Table 3.4 below. Table 3.4 Income of Hypothetical Population of 12 Persons Individual Income Individual Income 1 $1,300 7 1,800 2 6,300 8 2,700 3 3,100 9 1,500 4 2,000 10 900 5 3,600 11 4,800 6 2,200 12 1,900 TOTAL INCOME: $32,100 AVERAGE INCOME: $2,675 A frequency distribution of the sample means is illustrated in Table 3.5 for samples of sizes 1,2,3,4,5,6 and 7 individuals. For each sample size, the percentage of the sample estimates falling within a specified range of the true value and the average of the means are also shown in the table. For example, the proportion of the sample results falling between $2,000 and $3,400 is 47% for samples of 2; 58% for samples of 3; 69% for samples of 4; and 78%, 87% , and 94% for samples of 5,6, and 7 respectively. This tells us that by taking samples large enough, the proportion of the sample estimates falling within a designated interval about the expected value can be made as close to 100% as desired. That is, we can predict the precision of a sample if we have the distribution of all sample estimates of a given size for the population. The increasing concentration of sample estimates around the true value illustrates consistency, a quality possessed by important types of sample estimates. 16 Table 3.5 All Possible Estimates of Average Income from Samples Drawn Without Replacement from the Population of 12 Persons Average Income Estimated from Sample Number of samples having indicated estimate of average income with sample of size n n =1 $ 800 to $1,199 $1,200 to $1,399 $1,400 to $1,599 $1,600 to $1,799 $1,800 to $1,999 $2,000 to $2,199 $2,200 to $2,399 $2,400 to $2,599 $2,600 to $2,799 $2,800 to $2,999 $3,000 to $3,199 $3,200 to $3,399 $3,400 to $3,599 $3,600 to $3,799 $3,800 to $3,999 $4,000 to $4,199 $4,200 to $4,399 $4,400 to $4,599 $4,600 to $4,799 $4,800 to $6,399 Number of Samples Average of all possible samples n=2 n=3 n=4 n=5 n=6 n=7 1 1 1 2 1 1 1 1 1 2 1 2 5 6 5 6 6 6 3 4 3 3 2 2 3 3 2 1 1 2 3 10 15 20 22 22 19 17 16 16 16 13 10 7 4 6 1 2 1 1 11 25 42 50 52 52 49 57 46 38 26 21 11 10 3 1 - 7 25 55 78 90 101 108 101 81 61 46 27 10 2 - 1 16 50 84 109 139 151 133 107 79 43 12 - 6 27 61 98 136 150 130 108 62 14 - 12 66 220 495 792 924 792 $2,675 $2,675 $2,675 $2,675 $2,675 $2,675 $2,675 * * Expected Value This means that if the sample is sufficiently large, one takes very little risk in using sample estimates. (From the above illustration, it might appear that the increase in concentration arises from the fact that, as the size of the sample increases, the percentage of the population in the sample becomes higher. Actually, similar results would be observed when the size of sample increases even though only a small proportion of the universe is included.) 3.4 PREDICTING RELIABILITY OF SAMPLE ESTIMATES (CONFIDENCE INTERVAL) We have seen that the precision of a sample can be predicted if we have the distribution of all sample estimates of a given size for the population. In a real situation, we can not select all possible samples and examine the estimates derived from them. We must depend upon a single sample. Therefore, it is necessary to find some measure of the extent to which the estimates made from various samples differ from the true value; this measure, if it is to be useful, must be one that can be estimated from the sample itself. Before showing how and why we can do this, we shall introduce certain definitions and relationships which are derived from the theory of sampling. 17 3.4.1. Standard Deviation We shall show that there is a measure of the variability in the original population which can be estimated from the observations in a single sample, and from which it is possible to estimate the expected error in the sample mean. The measure of variability in the population is called the standard deviation; its square is called the population variance and is designated by the symbol F2 or VAR. The variance of the population is defined as the average of the squares of the deviations of all the individual observations from their mean value. Thus, it would be computed by the following process, if all the values in the universe could be observed: where the Y's with subscripts are individual observations and is the mean of the N observations for the N elements in the universe. Note that it has become fairly general practice to denote the population variance by F2 when dividing by N, and by S2 when dividing by N-1; symbolically, Its sample equivalent is given by: where n is the sample size, yi is the sample measurement of a characteristic and mean. is the sample We will use S2 throughout the text because s2 is an unbiased estimate of S2. Note that all results are equivalent in either notation. Also, 3.4.2 Sampling Error of Sample Means The variance of the sample means is the average of the squares of the deviations of the means of all possible samples of size n from the true mean. The variance of write: 18 is denoted by and we where f = (n/N) = sampling fraction. The square root of the variance of The sampling error of is: is called the sampling error for means of samples of size n. It is important to note that the sampling error varies with the size of the sample, as we would expect. If we compute the sampling error for all possible samples of sizes shown in Table 3.5, we see that as the sample size increases, the sampling error becomes smaller and smaller. This is shown in the following illustration (see Table 3.6). The factor (N - n)/N in the formula for the variance of is called the finite population correction factor (fpc). As a rule of thumb, if n #0.05N we can ignore (N - n)/N since its value will be close to 1. Otherwise we should include it in the formula in order not to severely overestimate the variance of 3.4.3 Illustration Consider again the population of 12 individuals in Table 3.4. In this case, the true average is with N =12. We compute S2 as follows: and S = $1,571.41. An easier way to calculate S2 is as follows: Using S, we can compute the sampling error of the sample mean for different sample sizes n. For example, if the sample size n = 1 then, 19 for n = 2, The sampling errors for all possible sample sizes are given in the following table. Table 3.6 Sampling Error of Estimates of Average Income for Various Sample Sizes Size of Sample Sampling Error of Estimated Measure 1 $1,505 2 1,015 3 786 4 642 5 537 6 454 7 383 3.4.4 Interval Estimate (Confidence Interval) We know that the probability of an estimate being equal to the true value (parameter) is zero for continuous variables. Thus, it will be more useful if we can state how probable it is that an interval based on our estimate will contain the parameter to be estimated. Interval estimator - An interval estimator is a formula that tells us how to use the sample observations to calculate two numbers that define an interval which will enclose the estimated parameter with a certain (usually high) probability. The resulting interval is called a confidence interval and the probability that it contains the true parameter is called its confidence coefficient. If a confidence interval has a confidence coefficient equal to .95, we call it a 95% confidence interval. In general, the confidence interval for a parameter 2 is given by The symbol t is the value of the normal deviate corresponding to the desired confidence probability. 20 In practice, S2 is not known. Usually, s2, the sample variance is calculated from the sample data and used as an estimate of S2. If n is large, s provides a fairly good estimate of S; however, for small samples this may not be the case. Using s, the confidence interval is For the parameter the confidence interval is: (Ignore the fpc if The value t depends on the level of confidence desired. For large samples, the most common values (see Appendix I - Normal Distribution Table) are: t = 1.28 for 80% confidence level t = 1.64 for 90% confidence level t = 1.96 for 95% confidence level t = 2.58 for 99% confidence level. If the sample size is less than 30, the percentage points may be taken from the Student's t table (see Appendix II) with (n-1) degrees of freedom. 3.4.5 Approach to Normal Distribution Comparing Tables 3.5 and 3.6, it can be seen that as the sample size increases, the sample estimates differ less and less from the expected value, and at the same time the sampling error becomes smaller and smaller. In practical sampling problems, where a reasonably large sample is used (generally 30 or more cases), the distribution of sample results over all possible samples approximates very closely the normal distribution-- the familiar bell-shaped curve. This is the result of the most important theorem in statistics, The Central Limit Theorem, which states, briefly, that sums of random variables have a normal distribution. For this distribution, the probabilities of being within a fixed range of the average value are well known and have been published (see Appendix I). These probabilities depend solely on the value of the sampling error. For example, the probability of being within one sampling error is 68 percent; for two sampling errors, it is 95 percent; for three sampling errors, it is 99.7 percent. The implications are of fundamental importance to sampling theory. Suppose we have drawn a simple random sample from a population, have computed the mean from the sample and have estimated the true sampling error of the mean by means of How can we infer the 21 precision of this particular sample result? If we set an interval based on estimate we can be fairly confident that around the sample will give an interval such that one will be correct about two-thirds of the time that the interval covers the true mean. Similarly, gives a confidence interval for which the assumption will be correct 95 percent of the time, and for it will be correct 99.7 percent of the time. To understand the concept, we present the following illustration. 3.4.6 Illustration Consider again the same population of 12 individuals in Table 3.4. Let us find the percent of sample averages in Table 3.5 which differ from the population average by less than less than and less than (We are using capital S instead of small s, as well as because we are dealing with a population and we therefore know its true variance and its true mean). This is the same as finding the percent of sample averages which fall within and with Consider a sample of size 2. Using Table 3.5 we have: Table 3.5 shows that there are 42 sample averages that fall within the confidence interval (1660, 3690). That is, 63.6% of sample averages differ from the population average by less than one sampling error. Similarly, there are 64 averages that fall within the confidence interval (645, 4705); that is, about 97% of sample averages differ from the population average by less than two sampling errors. It can easily be seen that 100% of sample averages differ from the population average by less than three sampling errors. For the normal distribution, we have seen that the probability of being within one standard (or sampling) error is 68%; for two standard errors, it is 95%; for three standard errors it is 99.7%. This shows that even for small samples of size 2, the distribution of sample results over all possible samples approximates very closely the normal distribution. For larger samples, the results would conform to the normal distribution much more closely. The percentages of sample averages in Table 3.5 which differ from the population averages by less than and are displayed in Table 3.7. 22 Table 3.7 Concentration of Sample Results Around the Population Average Sample of Size n Percent of sample averages in Table 3.5 differing from the population average by less than less than less than 1 $1,505 75 92 100 2 1,015 64 97 100 3 786 65 96 100 4 642 64 97 100 5 537 65 97 100 6 454 64 97 100 7 383 65 97 100 68 95 99.7 NORMAL DISTRIBUTION Consider the distribution given in Table 3.5 of average income in all possible samples of size 7. A graph of this distribution is shown in Figure 3.1. This figure appears approximately symmetric, with a clustering of measurements about the midpoint of the distribution, tailing off rapidly as we move away from the center of the histogram. Thus, the graph possesses the following properties: Figure 3.1 DISTRIBUTION OF AVERAGE INCOME IN ALL POSSIBLE SAMPLES OF SIZE 7 23 (1) The sampling distribution of size is large. appears approximately normally distributed when the sample (2) The average of all possible sample averages equals the population average. (3) The variance of the sampling distribution is equal to which is less than the population variance, Property (1) above is the result of the Central Limit Theorem (CLT), one of the most fundamental and important theorems in statistics. Briefly stated, the CLT shows that if x1, x2, ... , xn are independent random variables having the same distribution with mean : and variance F², then for a large enough sample, the variable has a standard normal distribution (i.e., mean zero and variance one). 3.4.7 Illustration Unoccupied seats on flights cause the airlines to lose revenue. Suppose a large airline wants to estimate the average number of unoccupied seats per flight over the past year. To accomplish this, the records of 225 flights are randomly selected from the files, and the number of unoccupied seats is noted for each of the sampled flights. The sample mean and standard deviation are 11.6 seats and s = 4.1 seats Estimate the mean number of unoccupied seats per flight during the past year, using a 90% confidence interval (ignore the fpc). The 90% confidence interval is, 24 that is, at the 90% confidence level, we estimate the mean number of unoccupied seats per flight to be between 11.15 and 12.05 during the sampled year. 3.4.8 Sampling and Nonsampling Errors Estimates are subject to both sampling errors and nonsampling errors. Sampling error arises because information is not collected from the entire target population, but rather from some portion of it. Through the use of scientific sampling procedures, however, it is possible to estimate from the sample data the range within which the true population value (parameter) is likely to be with a known probability. Nonsampling error, on the other hand, is defined as a residual category consisting of all other errors which are not the result of the data having been collected from only a sample. These include errors made by respondents, enumerators, supervisors, office clerical staff, key coding operators, etc. 3.4.9 Total Error (Mean Square Error) The total error is the sum of all errors about a sample estimate, both sampling and nonsampling, both variable and systematic. An illustration of the composition of the total error follows: Total Error Sampling Error Non Sampling Error Bias Variable Error Variable Error Bias In practice, the bulk of sampling error consists of variable error, and by contrast the bulk of nonsampling error is bias. Mathematically, the total error is represented by the mean square error. In terms of expected values, the mean square error of the estimate is denoted by the and is given by: which is the average of the squares of deviations of all possible estimates from the parameter. Recall that If the estimates are unbiased, the mean square error is equivalent to the variance. 25 Exercises 3.1 Assume that you know the distribution of the number of cows in a population of eight farms, as follows: Farm 1 2 3 4 5 6 7 8 Number of Cows 4 5 0 3 2 1 1 0 a. Calculate the true average number of cows per farm. b. Calculate the true standard deviation and variance of the number of cows per farm. c. Take all possible samples of two farms each and calculate the average number of cows per farm for each sample. d. Compute the average of the 28 means obtained in c. and compare it with the true mean. e. Compute the true sampling error for means of samples of 2 farms. f. Find the proportions of the 28 values of that are between and How do they compare with the expected proportion assuming the sampling distribution of 3.2 is normal? Consider the following distribution of N = 6 population values which represent "the number of household persons residing in the housing unit." Random samples of size 2 are drawn from this population. Housing Unit (HU) Household Size (HH) 1 5 2 6 3 7 4 8 5 9 6 10 a. Show that the mean of this population is 26 and its standard deviation is b. How many possible random samples of size 2 can be drawn from this population? List them all and calculate their means. c. Use the results obtained in b. to assign to each possible sample a probability and construct the sampling distribution of the mean for random samples of size 2 from the given population. d. Calculate the mean and the standard deviation of the probability distribution obtained in c. 3.3 A simple random sample of 100 households will be selected from a village of Nigeria. For this village 75 Naira per month is spent on electricity and s = 15 Naira. Find a 95% confidence interval for 3.4 Interpret the interval (ignore the fpc). A manufacturing company wishes to estimate the mean number of hours per month an employee is absent from work. The company decides to randomly sample 320 of its employees from a total of 5,000 employees and monitor their working time for 1 month. At the end of the month the total number of hours absent from work is recorded for each employee. If the mean and standard deviation of the sample are hours and s = 6.4 hours, find a 95% confidence interval for the true mean number of hours absent per month per employee. 27 CHAPTER 4 SIMPLE RANDOM SAMPLING BASIC THEORY ________________________________________________________________________________ 4.1 SIMPLE RANDOM SAMPLING The simplest method of probability sampling is simple random sampling (SRS). To introduce the idea of a simple random sample, let us ask the following questions: (1) How many distinct samples of size n can be drawn from a population of size N? (2) How can we define a simple random sample? (3) How can a random sample be drawn in actual practice? To answer the first question, we use combinatorics, which allows us to choose n objects out of a total of N in ways, where N! = N (N-1) (N-2) ... (3) (2) (1). different samples of size n = 2 can be drawn For instance, from a population of size N = 5. To answer the second question, we make use of the answer to the first one and define a simple random sample of size n (or more briefly, a random sample) selected from a population of size N as a sample which is chosen in such a way that each of the probability of being selected. This probability is equal to: 28 possible samples has the same For example, if a population consists of the N = 5 elements A, B, C, D and E (which might be the incomes of five persons, the number of persons in five households, and so on), there are possible distinct samples of size n = 3; they consist of the elements ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, and CDE. If we choose one of these samples in such a way that each has the probability of being chosen, we call this sample a simple random sample. With regard to the third question of how to take a random sample in actual practice, we could, in simple cases like the one above, write each of the possible samples on a slip of paper, put these slips into a hat, shuffle them thoroughly, and then draw one without looking. Such a procedure is obviously impractical, if not impossible, given the size of most populations; we mentioned it here only to make the point that the selection of a random sample must depend entirely on chance. Fortunately, we can take a random sample without actually resorting to the tedious process of listing all possible samples. We can list instead the N individual elements of a population, and then take a random sample by choosing the elements to be included in the sample one at a time, making sure that in each of the successive drawings each of the remaining elements of the population has the same chance of being selected. The selection may be accomplished by either sampling with replacement or sampling without replacement. In sampling from a finite population, the practice usually is to sample without replacement. Most of the theory which will be discussed is based on this method. For example, to take a random sample of 12 of a city's 273 drugstores, we could write each store's name (address, or some other business identification number) on a slip of paper, put the slips of paper into a box or a bag and mix them thoroughly, and then draw (without looking) 12 of the slips one after the other without replacement. Even this relatively easy procedure can be simplified in actual practice; usually, the simplest way to take a random sample from a population of N units is to refer to a table of random numbers (see Appendix III). In practice, however, the members of the population are sorted according to certain rules and then a systematic selection of n elements is carried out. The sample thus obtained is, for all practical purposes, a simple random sample. 4.1.1 Procedure for Selecting a Simple Random Sample (Use of Random Number Tables) A practical procedure of selecting a random sample is to choose units one by one with the help of a table of random numbers. Tables of random numbers are used in practical sampling to avoid the necessity of carrying out some operation such as selecting numbered chips from an urn to designate the units to be included in the sample. Moreover, experience has shown that it is practically impossible to mix a set of chips thoroughly between each selection, that devices such as cards or dice 29 have imperfections in their manufacture, that in thinking of numbers at random people tend to favor certain digits, etc. Consequently, such methods do not, in fact, give each member of the population an equal chance of selection. The use of a table of random numbers, however, reduces the amount of work involved, and also gives much greater assurance that all elements have the same probability of selection. Many tables of random numbers are readily available. There are several in the series of Tracts for Computers, notably tables compiled by Tippett, and by Kendall and Smith. The RAND Corporation has published A Million Random Digits. Sets are also available in Statistical Tables by Fisher and Yates, and in other sources. Many of these publications describe the methods of compilation and the uses of the tables. Some microcomputer packages such as LOTUS spreadsheets also have a random number generator which can also be used to generate pseudo-random numbers, but these random number generators provide random numbers between 0 and 1. A table of random numbers is given in Appendix III. Typically, these tables show sets of random digits arranged in groups both horizontally and vertically. To select a set of random numbers, one can start anywhere on a page. Furthermore, after selecting the first number, one can proceed down a column, across a row, up a column, or in any other pattern that is desired. 4.1.2 Illustration To obtain a random number between 1 and a given number, for example between 1 and 273, proceed as follows: Notice how many digits are in the upper limit number (for 273 there are three digits). Use this number of columns, counting from the first (or a predetermined) column, and start at the top (or on a predetermined line). Each line in the set of three columns has a 3-digit number. Choose the first of these which is between 001 and the given number, inclusive. That is, between 001 and 273 in our example. Discard numbers which are greater than 273 and discard 000. If more than one random number is desired, continue down the three columns, choosing each 3-digit number which is between 001 and 273 until the desired 3-digit random number is obtained. If a number is chosen two or more times, use it only once.1 Suppose we have a part of a table of random numbers as follows: 1089 9385 6934 0052 5736 1901 5372 8719 7902 8660 1007 9249 5988 6212 Within the limits of the numbers in the examples which follow, we shall select random numbers from the above table, using a selected number only once. 30 Example A: Select 3 numbers at random between 1 and 10. First choose an arbitrary column, having decided to let 0 stand for 10. Suppose we choose the fifth column. The first number in the column is 8; the second number is 7; the third is 8 again. Since 8 has already been selected, we skip it and take the next number which is 1. The three numbers selected, therefore, are 8, 7, and 1. Example B: Select 5 numbers at random between 1 and 80. Suppose we take the first two columns as our choice of a start. First take 10; discard 93 since it is not between 01 and 80; take 69; discard 00 (which represents 100); and take 57, 19, and 53. 4.1.3 Caution in the Use of Random Table If we use a table of random numbers frequently, we should not always use the same part. For example, if the first random number is always taken from the same column of the same page, the same set of numbers would be used repeatedly, and we would not get proper randomization. If tables of random numbers are used frequently, one can continue from the last random number selected for the previous sample, or a new starting point should be taken for each use. 4.2 NOTATION The notation defined in this section is appropriate not only for simple random sampling, but also for most designs. They provide a key to the system used throughout this manual. Capital letters refer to population values and lower case (small) letters denote corresponding sample values. A bar (-) over a letter denotes an average or mean value and (^) over a letter indicates an estimate. We shall use the following notation: N = total number of units in the population n = total number of units in the sample Yi = value of a characteristic as measured on the I-th unit in the population; I = 1, 2, ... N yi = value of a characteristic as measured on the I-th sample unit; I = 1, 2,...n total value of a characteristic in the population total value of a characteristic in the sample ( or sum of sample values) population mean 31 sample mean population variance population variance and sample variance sampling rate or sampling fraction sampling weight (expansion factor) CV = coefficient of variation cv = estimated coefficient of variation As we mentioned earlier, we shall use, unless otherwise mentioned, S² for the population variance. The difference between S² and F2 disappears for large populations. In general, the population variance, S², is not known. The sample variance, s², will be used as its estimate; this will hold throughout the course regardless of the sampling scheme being discussed. It should be noted that in simple random sampling, s², is an unbiased estimate of S². 4.2.1 Population Values, Their Respective Estimates, and Measures of Precision The sample estimate of the population total value, Y, is denoted by and can be written as: (4.1) where is the estimate of the population average, 32 and is given by: (4.2) The sampling error of the estimate of is: (4.3) and the sampling error for is: (4.4) The corresponding formulas for the estimated sampling error are: (4.5) (4.6) 4.2.2 Illustration Let us verify equation (4.3) with the data for the 12 individuals discussed previously (see Chapter 3). We have already used equation (4.4) for the means of samples of sizes 1 and 2 in illustration 3.4.3 (page 19), and their standard errors for different sizes were given in Table 3.6 of Chapter 3. Using this table, the total income of 12 individuals can be estimated. Equation (4.3) can be expressed as: Using Table 3.6 of Chapter 3, the sampling error of the estimated total income for samples of size 2 is: 33 4.2.3 Relative Error Often we wish to consider not the absolute value of the standard error, but its value in relation to the magnitude of the statistic (mean, total, etc.) being estimated. For this purpose, one can express the standard error as a proportion (or a percent) of the value being estimated. This form is called the relative standard error, or coefficient of variation and is denoted by the symbol CV. The true population CV (for a given characteristic or variable) is defined as follows: The sample cv (for a given characteristic or variable) is given by: The true CV of an estimate is denoted by: where 2 represents any estimate (mean, total, proportion, ratio). To estimate the true we use the following formula which uses data from a sample: One advantage of expressing error as a coefficient of variation is that it is unitless, unlike absolute measures, like the standard deviation and the sampling error. The CV is useful when making comparisons because no units enter into play. The population CV refers to the relative sampling error of means of samples of 1 unit (that is, the population standard deviation expressed as a proportion of the population mean) and it’s denoted simply by CV (not followed by a parenthesis). Thus, for the estimate of the total, the true coefficient of variation is: (4.7) 34 Similarly, for the estimate of the sample mean, the coefficient of variation is: (4.8) That is, equation (4.7) is equal to equation (4.8). Therefore, The corresponding formulas for the estimated (obtained from a sample) coefficient of variations are: (4.9) (4.10) The standard error of the estimated total is N times that of the mean, while the coefficients of variation of the two estimates are the same; this result is, upon reflection, not unexpected. An estimated total is obtained by multiplying the sample mean (an estimate) by the number of elements in the population (a known number); the only source of error is the sample mean. Therefore, we should expect that, when expressed as a proportion or percentage, the error in the total would be the same as that in the mean; however, when the error in the total is expressed in absolute terms, it would be N times as large as the error in the mean, since N is the factor of multiplication. The big advantage of the coefficient of variation is that it permits comparison of two distributions of values even though they may be totally unrelated. For example, one could compare the variability of length of mice tails to weight of elephants. This is possible because variability is expressed relative to the mean, that is, it is the average variability per unit of mean. Another way to look at the coefficient of variation is to consider it as a measure of dispersion for relative deviations. Recall that the variance of Yi was given by This is a measure of dispersion of the absolute deviations 35 If we now consider the relative deviations square them, add them, and then average them over N, we get the following expression: which is called the relative variance of the distribution or simply the relvariance. If we rearrange terms in the above expression, we get: The square root of this last expression is the population coefficient of variation mentioned before. 4.3 SAMPLING FOR PROPORTIONS An important class of statistics for which the formulas for variance and the formulas for determining the size of sample become particularly simple is the estimation of the proportion of units having a certain characteristic. 4.3.1 Types of Statistics for Which Proportions are Used Proportions arise in two ways in statistical analysis. First of all, we are frequently interested in a statistic that is a proportion, rather than a total or an average; for example, the proportion of the population that is unemployed, or the percentage of families with income greater than a certain amount, or the proportion of business firms interested in purchasing a particular product. Secondly, it may be desired to classify a population into a number of groups, and to find the percentage of the total population in each of these groups. The groups may have a natural ordering as in distribution by age (0 to 4 years, 5 to 9, 10 to 14, etc.) or income classes; or they may be groups having no natural order, such as those in an industrial classification of business firms, where the groups can be arranged in a number of ways. The analysis is the same whenever the proportion of the total in each group is the statistic to be measured. 36 4.3.2 Relationship to Previous Theory Suppose we think of the total population and the sample in the following way. Consider a particular class of units in which we are interested, and use the following notation: A = a = Total number of units in that class in the population Number of units in that class in the sample P = p = True proportion of units in that class in the population Proportion in that class in the sample Q = q = Population proportion not in that class (Q = 1 - P) Proportion not in that class in the sample (q = 1 - p). Note that and All of the formulas discussed in previous lectures can be applied to this particular case by considering each member of the population as having a characteristic which can have only one of two values, either 0 or 1. If the member is in a particular class in which we are interested, the value assigned is 1; if the member is not in the class, the value is 0. Examining the entire population, we can see that the A members of the class each have a value of 1; the rest have a value of 0. Adding up the values for all elements of the population, we get A. In other words, A can be considered as the equivalent of the same way as that we have already discussed. Similarly, can be considered in We can now use the previous formulas. It turns out that they are particularly easy to use in this case. 4.3.3 Applicable Formulas In sampling for proportions, the following formulas are applicable (with simple random sampling): (4.11) and That is, an estimate of the proportion in the population is obtained by using the sample proportion, and an estimate of the total number of units having the characteristic is obtained by multiplying the sample proportion by the total number of units in the population. Also (4.12) The population variance is PQ. Note that it is the variance of the population distribution giving the 37 value of 1 or 0 to an element depending on whether or not it is in the class (whether it has the attribute in question). It can still be estimated by pq, unless n is very small (for example n < 30) in which case the formula is The variance of the estimate of the proportion which is computed from all samples of size n is (4.13) The estimate of this variance which is made from a single sample of n observations is (4.13a) See equations (4.4) and (4.6). These are the same formulas given previously for substituted for S2, and with PQ with pq substituted for s². Similarly, the formulas given in the previous section for the relative standard error (coefficient of variation) of a mean and the sampling error of an estimated total are given by: (4.14) and, (4.15) Again the relative standard error of the total is the same as that of the mean. The confidence interval for the proportion is derived on the same assumptions as for the quantitative characteristics, namely that the sample proportion p is normally distributed. From (4.13a) for the estimated variance of p, one form of the normal approximation to the confidence interval for p is: (4.16) where the value t depends on the level of confidence desired (see Section 4.4 of Chapter 3). 38 4.3.4 Illustration Estimate of sampling error.--Suppose that the proportion of farms that grow maize in a given area is 0.40; what would be the sampling error in estimating this proportion from a random sample of 500 farms, if the total number of farms in the area is 10,000? In this case, N = 10,000 P = 0.40 n = 500 Q = 0.60 We have Consequently, How is the figure of 0.021 to be interpreted? This means that if we establish an interval around the true proportion of (or 0.379 to 0.421), there is a reasonably good chance (68 percent) that a sample of 500 farms will give a proportion somewhere between 0.379 and 0.421. If we double the interval to get a range of 0.358 to 0.442, the chance is about 95 percent that the sample estimate will be within that range. If an interval based on three times 0.021, (or 0.063) is used, the chance is 0.997 (or nearly certain) that the sample estimate will be within that range. In normal practice, it is customary to use a 2-S range (two standard/sampling errors) as providing sufficient confidence in the accuracy of the estimates. If very important decisions are to be based on the results of the survey, and we wish to be almost absolutely sure of the range within which the sample estimate will lie, we can use a 3-S level. It is difficult to conceive of cases in which 3-S would not be sufficient. In this example, both the proportion (0.40) and the chance of the sample estimate being within a certain range around this proportion were known. In practice, we are usually interested in the converse of this situation, in which we do not know the true proportion but we do have a sample estimate of 0.40 based on a sample of 500 farms out of 10,000. We wish to establish ranges around the sample figure which will be expected to include the true mean. For all practical purposes, the same statements can be made as before by substituting the term "true figure" for "sample estimate." That is, if the sample shows that 0.40 of the farms grow corn and we establish a range the chances are about 68 percent that this range will include the true figure; the chances are about 95 percent that the interval 0.358 to 0.442 will include the true figure; etc. 4.3.5 Procedure When P Refers to a Subset of a Class Frequently, the proportion to be estimated is a percentage, not of the total population, but of a particular class. For example, we may be interested not in unemployment expressed as a percentage of the total population, but as a percentage of persons in the labor force; or we may need to know the proportion of firms with more than 5 employees in a particular industry. In such cases, a very close approximation to an exact analysis can be made by using the formulas listed above, but interpreting 39 the numbers N and n as applying to the class in which we are interested. That is, N would not be considered the total population but would be the number of persons in this class (for example, the total number of persons in the labor force) as estimated from the sample; n would be the number of sample cases in this class; a would be the number of sample cases in the subset (for example, the number of unemployed). 4.3.6 Tabled Value of Table 4.1 shows the value of for specified values of P and n. As described in sections 4.3.7 and 4.3.8 below, we can use the simplified formula (4.17) to compute the standard error of the proportion of units having a certain attribute, if the sample is an unrestricted (simple) random sample and if N is so large relative to n that the factor (N-n)/N in the formula has a value very close to 1. Since the true proportion in the population (P) is not known, the estimate from the sample (p) may be used in equation (4.17) to give an estimate of the sampling error of p: (4.18) Most samples are stratified; that is, they are not simple random samples. We shall see later that this has the effect of making the sampling error smaller than it would be for a simple random sample of the same size. However, most samples used in surveys are also clustered and we shall also see that this has the opposite effect of making the sampling error larger than it would be for a simple random sample of the same size. When the sample is both stratified and clustered, the formulas for the standard error become more complex. Sometimes it is not possible to work out the exact formulas, but a rough estimate of the standard error can be obtained by using the simple formula of equation (4.17) with an allowance for the expected net effect of departures from randomness in the sample design. If the units of analysis are clustered into rather small groups--for example, 5 housing units or 25 persons in a cluster, and the persons within a cluster are rather similar, as in a cluster located in a rural area--the standard error of a proportion as read from Table 4.1 might be multiplied by a factor such as 1.25. This factor is a design effect. In a larger cluster, such as a city block with 40 or 50 housing units, the factor to be applied to Table 4.1 might be 1.75, even though the persons within the cluster are less alike in an urban area than in a rural area. The size of the design effect to be used depends on the sample design and the nature of the population; it can sometimes be roughly estimated by an experienced sampling statistician, using 40 past experience and mathematical formulas involving the “intraclass correlation.” Table 4.1 SAMPLING ERROR OF AN ESTIMATE OF A PROPORTION IN SIMPLE RANDOM SAMPLING = for specified values of P and n) P = Proportion of units having a characteristic (Q = 1-P has the same standard error) n = number of sample cases .001 or .999 .002 or .998 .01 or .99 .02 or .98 .03 or .97 .04 or .96 .05 or .95 .10 or .90 .15 or .85 .20 or .80 .25 or .75 .30 or .70 .40 or .60 .50 50 .0045 .0063 .0141 .0198 .024 .028 .031 .042 .051 .057 .061 .065 .069 .071 100 .0032 .0045 .0099 .0140 .017 .020 .022 .030 .036 .040 .043 .046 .049 .05.0 200 .0022 .0032 .0071 .0099 .012 .014 .016 .021 .025 .028 .031 .033 .035 .035 300 .0018 .0026 .0058 .0081 .0099 .012 .013 .017 .021 .023 .025 .027 .028 .029 400 .0016 .0023 .0050 .0070 .0086 .010 .011 .015 .018 .020 .022 .023 .024 .025 500 .0014 .0020 .0045 .0063 .0076 .0089 .0098 .013 .016 .018 .019 .021 .022 .022 600 .0013 .0018 .0041 .0057 .0070 .0082 .0090 .012 .015 .016 .018 .019 .020 .020 700 .0012 .0017 .0038 .0053 .0065 .0076 .0083 .011 .014 .015 .016 .017 .019 .019 800 .0011 .0016 .0035 ..0050 .0061 .0071 .0078 .011 .013 .014 .015 .016 .017 .018 1000 .0010 .0014 .0032 .0044 .0054 .0063 .0070 .0095 .011 .013 .014 .015 .015 .016 1200 .0009 .0013 .0029 .0040 .0049 .0058 .0064 .0087 .010 .012 .013 .013 .014 .014 1500 .0008 .0012 .0026 .0036 .0044 .0052 .0057 .0077 .0093 .010 .011 .012 .013 .013 1700 .0008 .0011 .0024 .0034 .0042 .0049 .0053 .0073 .0087 .0097 .011 .011 .012 .012 2000 .0007 .0010 .0022 .0031 .0038 .0045 .0049 .0067 .0081 .0090 .0097 .010 .011 .011 2500 .0006 .0009 .0020 .0028 .0034 .0040 .0044 .0060 .0072 .0080 .0087 .0092 .0098 .0100 3000 .0006 .0008 .0018 .0026 .0031 .0039 .0040 .0055 .0066 .0073 .0079 .0084 .0090 .0092 3500 .0005 .0008 .0017 .0024 .0029 .0034 .0037 .0051 .0061 .0068 .0073 .0078 .0083 .0084 4000 .0005 .0007 .0016 .0022 .0027 .0032 .0035 .0047 .0057 .0063 .0068 .0073 .0077 .0079 4500 .0005 .0006 .0015 .0021 .0025 .0030 .0033 .0045 .0054 .0060 .0065 .0069 .0073 .0074 5000 .0004 .0006 .0014 .0020 .0024 .0028 .0031 .0042 .0051 .0057 .0061 .0065 .0069 .0071 In practice the sample value p would be used, inasmuch as the population value P would not be known. For values of n greater than 5,000, when n is multiplied by 100, the standard error is divided by 10. 4.3.7 The Design Effect (DEFF) 41 The design effect or DEFF is the ratio of the variance of the estimate obtained from the more complex sample (described later in this text) to the variance of the estimate obtained from a simple random sample of the same size. For instance, if is the variance of the estimate, say obtained from a complex sample, and is the variance of the same estimate based on a simple random sampling, then and where = variance obtained from the more complex design This approach is commonly used by practical samplers. For many situations where we can not estimate directly the variance of the estimate, we may be able to guess fairly well both the element variance S2 and DEFF from experience with similar past data. This comprehensive factor attempts to summarize the effects of various complexities in the sample design especially those of clustering. 4.3.8 Finite Correction Factor (or Finite Population Correction Factor) The exact formula for the relative variance (square of the coefficient of variation) of a mean for a simple random sample, or can be divided into two parts: and or The only way the size of the total population comes into the formula is in the expression This is usually called the finite population correction factor (fpc). If the population were infinite this factor would be 1 and the formulas would be much simpler: or 42 The value of is approximately equal to where is the sampling rate. If the sampling rate is small, say less than 0.05, the effect of the finite population correction factor is very small and, for all practical purposes, the finite population correction factor can be ignored. 4.3.9 Simplification for Large Populations With large populations and small sampling rates, the fpc can be ignored and the formulas become simpler. Simplified Formulae True Value Variance of the mean Variance of a proportion Coefficient of variation of the mean Coefficient of variation of a proportion Variance of a total Variance of the total number of units having an attribute Coefficient of variation of a total Coefficient of variation of total number of units having an attribute 43 Estimate CHAPTER 5 SIMPLE RANDOM SAMPLING ESTIMATION OF SAMPLE SIZE _______________________________________________________________________ 5.1 SPECIFIC CONSIDERATIONS FOR DETERMINING THE SAMPLE SIZE One of the first questions which a statistician is called upon to answer in planning a sample survey refers to the size of the sample required for estimating a population parameter with a specified precision. Making a decision about the size of the sample for the survey is important. Too large a sample implies a waste of resources, and too small a sample diminishes the utility of the results. When considering sample size determination, there are three very important concerns: ACCURACY, PRACTICALITY, and EFFICIENCY. 5.1.1. Accuracy Accuracy can be defined as an inverse measure of the total error. Total error is the sum of sampling error (SE) and nonsampling error (NSE). Sampling error arises because only a part of the population is observed, and not all of it. The terms PRECISION and RELIABILITY are associated with sampling error. Estimator A is more precise or more reliable than estimator B if the sampling error of A is smaller than the sampling error of B. Nonsampling errors are usually biases which are very often due to poor quality control of the survey operations (poor questionnaire design; interviewers that are not well trained; response errors; etc.) 5.1.2. Practicality To obtain an accurate estimate, both sampling and nonsampling errors must be reduced. However, accuracy may come into conflict with practicality because: 5.1.3. a. to reduce sampling errors and increase precision, the sample size must be large. b. too large a sample can impose an excessive burden on the limited resources available (and resources are usually very limited) and increase the likelihood of nonsampling errors. Efficiency A further concern is that a given sample size can produce different levels of precision depending on which sampling techniques are chosen. This concept is known as the statistical efficiency of the design. The most efficient design is the one that gives the most precision for the same sample size. Therefore, expert sample design is needed in the determination of the optimal sample size. 44 Example 5.1 A population consists of N = 5000 persons. A simple random sample without replacement (SRSWOR) of size n = 50 included 10 persons of Chinese descent. A 95% confidence interval for P, the proportion of persons of Chinese descent in the population, is: The conclusion is that between 8.9% and 31.1% of the population is of Chinese descent. This interval is too wide to be useful. There are two ways in which a narrower interval could be obtained: < < by lowering the confidence level, or by increasing the sample size There is a point at which lowering the confidence level is not attractive. We shall consider the problem of determining the sample size necessary to produce a fixed level of precision. The following eight steps are taken into account when determining the sample size. We will study each one in detail. (1) Degree of precision desired (2) Formula to connect n with desired precision (3) Advance estimates of variability in population (4) Cost and operational constraints (5) Expected sample loss due to nonresponse. (6) Number of different characteristics for which specified precision is required. (7) Population subdivisions for which separate estimates of a given precision are required. These are also called domains of estimation. (8) Expected gain or loss in efficiency. 45 5.2 Degree of Precision The precision of an estimate refers to the amount of variable error, mainly sampling error, contained in an estimate. To lower the sampling error, that is, to increase the precision, we want n to be sufficiently large. Therefore, we decide on a target value for the precision of the estimate. The degree of precision desired can be stated in terms of: (1) The absolute error (E) for the estimate where is an estimate of the parameter 2 and (1- ") is the degree of confidence desired. The absolute error E is measured in the same unit used to measure the variable. For example, E = 5 hectares or E = $10,000 or E = 25 persons. (2) The relative error (RE) for the estimate This is E expressed as a proportion (or percentage) of the true value of the parameter being estimated. For example, if E = 5 hectares and the true value of the parameter is 100, then RE = 5/100 = 0.05 or 5%. (3) The target coefficient of variation (cv) for the estimate (v0) We set the cv (also known as the relative standard error) for the estimate equal to a target value v0. For example, we can have: Depending on which of the three ways we use to specify the precision, the formula for n will be different. The values of E, RE and " are usually decided by the user of the data in conjunction with the statistician. 5.3 Formula that Connects n (sample size) with the Desired Degree of Precision The following terms are used in the formulas outlined below. 46 S2 n = = the population variance; the desired sample size CV = the population coefficient of variation, N = Number of units in the population. k = 1 for 68% confidence 2 for 95% confidence 3 for 99.7% confidence = the coefficient of variation of the estimator, where v0 = specified target value for estimate’s E = Absolute error. RE = Relative error Note: the level of confidence states the probability that the n determined will provide the degree of precision specified. For example, a 95% level of confidence means that, except for a small chance (5%), we can be 95% certain that the precision specified will be reached with the calculated n. This is equivalent to saying that the acceptable risk is 5% that the true 2 will lie outside of the range specified in the confidence interval. 5.3.1. Sample size needed to estimate a mean with absolute error E could be used instead. where is the population mean. The sampling error of a mean using simple random sample is given by: Now, where k is a multiple of the sampling error, selected to achieve the specified degree of confidence. Therefore, if we substitute for (E/k), we get: (5.1) If we solve for n in (5.1) above, we get: (5.2) 47 If the population size is large and n # 0.05N, the finite population correction factor in equation (5.1) can be ignored because its effect would be minimal. In this case, we have: (5.3) Example 5.2 Consider a population consisting of 1,000 farms for which the population variance of the number of cattle per farm is 250 (N = 1,000 and S² = 250). Let us estimate the average number of cattle per farm from a sample; we wish to have reasonable confidence that the estimate will be close to the true value. Suppose the sample estimate is to be in error by no more than 1 (one head of cattle) from the true average, and we require an assurance of 95 chances out of 100 that the error will be no larger than 1. In this case, E=1 E² = 1 N = 1,000 S² = 250 k = 2 (since 2 gives us almost a 95% confidence level); k² = 4. Applying equation (5.2), we see that n must be equal to or greater than If in the same situation we are satisfied with an error of not more than 3, with a confidence level of 95 percent, the only change in the formula would be in the values of E and E², as follows: E=3 and E² = 9. Then we would have, and a sample of 100 cases would be sufficient. Example 5.3 We wish to estimate the average age of 2,000 seniors on a particular college campus. How large a SRS must be taken if we wish to estimate the age within 2 years from the true average, with 95% confidence? Assume S2 = 30. E=2 and k = 2 48 5.3.2. Sample size needed to estimate a proportion with absolute error E The sample size n to estimate a population proportion P is obtained from equation (5.2); in this equation, but we’ll use the approximation S² = PQ (i.e., we’ll assume N is big enough so that N/(N-1) is very close to 1): (5.4) And for a large population size (n # 0.05N), we have from equation (5.3), (5.5) Example 5.4 Refer to Example 5.1 on page 50. Suppose we would like to estimate P, the proportion of persons of Chinese descent to within ± 3%, with 95% confidence. What sample size do we have to choose to achieve this target? Assume P to be no larger than 1/2. Now, let’s assume that we know that P # 0.25. What is the required sample size? 5.3.3. Sample size needed to estimate a total with absolute error E Using equation (4.3), and letting we get the following formula for n: (5.6) If we ignore the fpc, we have: 49 (5.7) 5.3.4. Sample size needed to estimate the number of units that possess a certain attribute with absolute error E To obtain the n necessary to estimate A, the number of units that possess a certain characteristic, simply substitute PQ in place of S2 in equations (5.6) and (5.7). 5.3.5. Sample size formulas when the error is expressed in relative terms (RE) We can obtain formulas for estimates when the desired error is expressed in relative terms instead of absolute terms. For relative errors (RE), if (RE) is a proportion of the estimates, substitute (RE/k) for (or in equation (4.7) or (4.8)). We will denote by cv the estimated coefficient of variation. The true population coefficient of variation is denoted by CV. We then have: (5.8) Note: This applies to both means and totals. If we ignore the fpc, then equation (5.8) becomes (5.9) NOTE 1: In actual practice, we usually do not know S² or (CV)2. Indeed we do not even know s² in advance of the survey. Instead, we use rough estimates of S² or (CV)2, obtained by the methods discussed in section 8 of chapter 6.XXXX NOTE 2: For the mean and the total, it is better to express the variance in relative rather than absolute terms, for two reasons: (1) Most importantly, because a population’s relative variance is more stable than its absolute variance. A guess or estimate of the population coefficient of variation CV (from past data or from similar populations) is likely to be closer to the true value than a guess or estimate of the variance. (2) The formula for n is the same for estimators of means or totals when it is expressed in terms of the coefficient of variation. 50 NOTE 3: To estimate the proportion P, it is preferable to use the absolute error previously discussed because the proportion is itself a relative quantity, so that taking the percentage of a percentage can become confusing. To obtain the formula for the sample size required to estimate a population proportion when the error is expressed as relative error (RE), use equation 5.8 where we replace (CV)2 = Q/P. That is, we get (5.10) If we ignore the fpc, equation (5.10) becomes: (5.11) Example 5.5 We would like to carry out a survey to estimate the total area in hectares of the farms in a population. The estimate should be within 10% of the true value. How many farms should be surveyed? (In a pilot survey, we estimated the population coefficient of variation, CV, of the variable farm size to be 1.2). Use 95% confidence. 5.3.6. Sample size formulas when the error is expressed in terms of the coefficient of variation Equation (5.8) can be expressed in terms of the coefficient of variation. If is the population coefficient of variation and is a specified target value for an estimate's coefficient of variation, then (5.8) becomes, 51 (5.12) If we ignore the fpc, equation (5.9) gives : (5.13) Equations (5.12) and (5.13) apply to both means and totals. Let’s consider Example 5.5 and use coefficients of variation to solve the problem. Example 5.6 Suppose that a survey was carried out to estimate the total area in hectares of the farms in a population. The estimate should be within 10 percent of the true value, with 95 percent confidence. How many farms should be surveyed? [In a pilot survey, we estimated the population coefficient of variation CV of the variable "farm size" to be 1.2]. In this case, k=2 CV = 1.2 RE = .10 Substituting in equation (5.13), we have, Example 5.7 The results from a pilot test are used to estimate 5,000 households. s = = and S for the variable ‘income’ in a population of $14,852 per household $12,300 A full scale survey is planned. What should be the sample size for this survey if we want to estimate the mean income per household with a cv no larger than 5%? The population coefficient of variation (CV) is estimated by: 52 5.4. Advance Estimates of Population Variances In the preceding section, we noted that most of the sample size formulas are written in terms of the population variance. In practice this is unknown and it must be estimated or guessed. There are five ways of estimating population variances for sample size determination. Method 1: Select the sample in two steps, the first being a simple random sample of size n1 (the first sample) from which estimates s1² and p1 of S² and P, respectively, are obtained. Then use this information to determine the required n (the final sample size). Method 2: Use the results of a pilot survey. This is one of the more commonly used methods. Method 3: Use the results of previous samples of the same or similar population. Method 4: Guess about the structure of the population and use some mathematical results. Method 5: (Only for qualitative characteristics.) If the statistic to be measured is a proportion, then make a fairly good guess of P (the proportion in the population). Method 1 carries out the survey in two steps. In the first step, only a subsample (a random part of the total sample) is enumerated. An analysis of this part permits one to estimate the variance and to make revisions in the total size of the sample, if necessary. In the second step, the remainder of the sample is enumerated in accordance with these changes, if any. This method gives the most reliable estimates of S² or P, but it is not often used, since it slows up the completion of the survey. Method 2 is one of the more commonly used methods. It serves many purposes, especially if the feasibility of the main survey is in doubt. If the pilot survey is itself a simple random sample, the preceding methods apply. But often the pilot work is restricted to a part of the population that is convenient to handle or that will reveal the magnitude of certain problems. Method 3 is also a very commonly used method. This method points to the value of making available, or at least keeping accessible, any data on standard errors obtained in previous surveys. Unfortunately, the cost of computing standard errors in complex surveys is high, and frequently only those standard errors needed to give a rough idea of the precision of the principal estimates are computed and recorded. If suitable past data are found, the value of S² may require adjustment for time changes. Experience indicates that the variance of an item tends to change much more slowly over time than the mean value of the item itself. Even if the mean value changes, the relative error may be quite stable. 53 Method 4 uses some mathematical results. Deming (1960) showed that some simple mathematical distributions may be used to estimate S² from a knowledge of the range (h) and a general idea of the shape of the distribution of the characteristic of interest: S² = 0.0289 * h² for a normal distribution S² = 0.083 * h² for a rectangular distribution (uniform) S² = 0.056 * h² for a distribution shaped like a right triangle S² = 0.042 * h² for an isosceles triangle. Approximate Values of S General Shape of Distribution Normal .17 * h Equilateral Triangle .20 * h Right Triangle (Skewed Distribution) .24 * h Uniform Distribution .29 * h These relations do not help much if the range, h, is large or poorly known. However, if h is large, good sampling practice is to stratify the population so that within any stratum the range is significantly reduced. Usually the shape also becomes simpler (closer to rectangular) within a stratum. Consequently, these relations are effective in predicting S², hence h, within individual strata. Example 5.8 The universities in the State of Maryland were classified according to the number of enrolled students into four size classes. The standard deviation within each class is shown below: Size Class (i) Enrollment Level, Xi < 1,000 1,000-3,000 3,000-10,000 Si 236 625 2,008 > 10,000 10,023 If you knew the class boundaries but not the values of Si, how well could you guess the values by using the Deming method? (No university has fewer than 200 enrolled students and the largest has about 50,000). We do not know the number of universities in each size class; therefore, we cannot obtain a frequency distribution that would show us the general shape of the distribution. A conservative estimate would be that the distribution is uniform. In this case, Si would be given by 0.29 * hi, where 54 hi is the range of each class. S1 S2 S3 S4 = = = = 0.29 (1,000 - 200) 0.29 (3,000 - 1,000) 0.29 (10,000 - 3,000) 0.29 (50,000 - 10,000) = = = = 232 580 2,030 11,600 Method 5: if the statistic to be measured is a proportion--for example, the proportion of farms growing corn--the population variance is approximately PQ. It is only necessary to be able to make a fairly good guess at P in order to estimate S². As long as the guess is reasonably close, we will get a good estimate of S². For example, suppose the true value of P is 0.4; then the value of S² = PQ would be 0.4 x 0.6 = 0.24. Suppose we made a rather poor guess of P, say 0.3. We would then estimate the value of the variance as 0.3 x 0.7 = 0.21, which differs from the true value by only about 10 percent. Note that we can also estimate S² by setting S² = PQ = (1/2)(1/2) because the formula for n is maximized when P = Q = 1/2. This latter is called a "conservative estimate," because we can never do worse than that. 5.5. Cost and Operational Constraints Let us recall that the total error is composed of both bias and variance. High sample sizes reduce the variance (i.e., yield high precision) but tend to increase cost and operational difficulties, which translates into larger nonsampling errors. To reduce the incidence of nonsampling errors, a survey needs: (1) (2) good quality control sufficient resources. However, in a real survey setting, there exist constraints with respect to: (a) (b) (c) (d) (e) budget field conditions field and office personnel time equipment and materials, etc. Hence, in addition to precision, we also need to consider the maximum sample size that can be handled by the available resources. It may be necessary to limit the sample size in order to stay within budget and operational constraints. If the maximum practical sample size is much smaller than that required to achieve the specified precision, calculations can be made to estimate the level of precision that could be expected from the actual sample size. If this level is not acceptable, greater resources have to be allocated to accommodate a larger sample size. To compromise between precision and practicality, we may take a sample size that is somewhere between the constraint-based and the precision-based sizes. 55 5.6. Expected Sample Loss Due to Nonresponse If past experience indicates that a certain level of nonresponse can be present, we may want to inflate the calculated sample size to compensate. This is because our calculations were based on a 100 percent response. If we do not obtain all the interviews, then the estimates will be based on a number smaller than the calculated n and will, therefore, have a greater variance than expected. Inflating Procedure We compute the inflated sample size n’ from the following relationship: where r is an estimate of the expected response rate and it can be obtained from previous rounds of the same survey, previous experience with similar surveys, a pilot (pre-test), etc. For example, we calculate n to be 1,000 units. Based on the results of a pilot survey, we anticipate the response rate to be 70 percent. Our inflated n will be: n' = 1000/.70 = 1,429 If our assumption was correct, we should get back 70% of 1,429 = 1,000. Therefore, our estimates will be based on the same number of units as expected and the target precision will be attained. Important Note Inflating the sample size when there is nonresponse only helps compensate for the resulting loss in precision. It does nothing for diminishing the resulting nonresponse bias. 5.7. Number of Different Characteristics Requiring a Specified Precision In most surveys information is collected from a sampling unit for more than one characteristic. One method of determining sample size is to specify margins of error for the characteristics that are regarded as most vital to the survey. An estimation of the sample size needed is first made separately for each of these important characteristics. When the estimations of n have been completed for each of the most important characteristics, it is time to take stock of the situation. It may happen that the n's required are all reasonably close. If the largest of the n's falls within the limits of the budget, this sample size is selected. More commonly, there is sufficient variation among the n's so that we are reluctant to choose the largest, either for budgetary considerations or because this will give an overall level of precision substantially higher than originally contemplated for the other characteristics. In this event the desired level of precision may be relaxed for some of the characteristics in order to permit the use of a smaller value of n. 56 In some cases the n's required for different characteristics are so different that some of them must be dropped from the survey; with the resources available the precision expected for these characteristics is totally inadequate. The difficulty may not be merely one of sample size. Some characteristics call for a different type of sampling scheme than others. With populations that are sampled repeatedly, it is useful to gather information about those characteristics that can be combined economically in a general survey and those that need special methods. As an example, a classification of characteristics into four types, suggested by experience in regional agricultural surveys, is shown in Table 5.1. In this classification, a general survey means one in which the units are fairly evenly distributed over some region as, for example, by a simple random sample. Table 5.1. AN EXAMPLE OF DIFFERENT TYPES OF ITEMS IN REGIONAL SURVEYS Type Characteristics of item Type of Sampling Needed 1 Widespread throughout the region, occurring with reasonable frequency in all parts. A general survey with low sampling fraction. 2 Widespread throughout the region but with low frequency. A general survey, but with a higher sampling fraction. 3 Occurring with reasonable frequency in most parts of the region, but with more sporadic distribution, being absent in some parts and highly concentrated in others. For best results, a stratified sample with different intensities in different parts of the region (Chapter 5). Can sometimes be included in a general survey with supplementary sampling. 4 Distribution very sporadic or concentrated in a small part of the region. Not suitable for a general survey. Requires a sample geared to its distribution. Example The following coefficients of variation per unit were obtained in a farm survey in Iowa, the unit being an area 1 square mile. Estimated cv Item Acres in farms (Y1 ) 0.38 Acres in corn (Y2 ) 0.39 Acres in oats (Y3 ) 0.44 Number of family workers (Y4 ) 1.00 Number of hired workers (Y5 ) 1.10 Number of unemployed (Y6 ) 3.17 57 A survey is planned to estimate acreage characteristics with a cv of 2½% and numbers of workers (excluding unemployed) with a cv of 5%. With simple random sampling, how many units are needed? How well would this sample be expected to estimate the number of unemployed? The results are displayed in the following table: Item Estimated cv Target cv for Estimate n Expected cv1 with n = 484 Y1 0.38 0.025 232 0.017 Y2 0.39 0.025 244 0.018 Y3 0.44 0.025 310 0.020 Y4 1.00 0.050 400 0.046 Y5 1.10 0.050 484 0.050 Y6 3.17 -- -- 0.144 Comments 1. Assuming cost and workload constraints permitted it, a sample of 484 segments should be taken (the largest calculated size). This sample size should guarantee the desired precision (or better) for the estimates of Y1 through Y5. As noted in the last column, the cv of the estimate is expected to be either as small as desired or smaller, if n = 484 is used. 2. As far as the estimate of Y6, a cv of approximately 14% can be expected if a sample size of 484 is used. Although it is true that the precision will be lower for this estimate than for the others, this is not critical because sponsors and data users did not require higher precision. 5.8. Population Subdivisions Requiring Separate Estimates of a Specified Precision If there are subpopulations or domains of estimation for which separate estimates of a given precision are required, we must resort to a different sampling strategy, such as the use of stratified sampling with different sampling rates by stratum. Under stratified sampling, each stratum or domain is considered a "population" in its own right. We can then apply the same principles to calculate separate sample sizes within each stratum to meet the precision requirements for the domain estimates. Often the same precision is required in each domain. If the variability and the cost within the domain are similar from domain to domain, then the sample sizes will be about the same in all domains. 1 58 The overall sample size would then be the sum of the stratum sample sizes. The overall estimate for the whole population would have a higher precision than the stratum-level estimates. For example, if the unemployment rate is to be measured at the national level with x% target cv, the national sample size computed would be n, say 5,000 households. On the other hand, if the unemployment rate is needed for each of 5 regions of the country, all with the same precision, the total (national) sample size required would be around 5n or 25,000 households. The national estimate would have a precision much higher than originally planned. 5.9. Expected Gain or Loss in Efficiency The formulas discussed so far are all based on simple random sampling (SRS). Let us denote as nsrs, the sample sizes obtained from those formulas. However, as will be seen later on, simple random sampling is rarely used in complex surveys. The efficiency of the design actually used is measured by comparing the variance of the estimator 2 obtained with the complex design and the variance of the same estimator with SRS. - If the complex design is more efficient, that is, inherently tends to produce a lower variance than SRS, then our precision is likely to be better than expected with nsrs. - If, on the other hand, the complex design is less efficient than the SRS one, that is, has an inherent tendency to produce a higher variance for 2 than SRS, then our expected precision level may not be met with the calculated nsrs. In this case, it would be desirable to inflate nsrs beforehand. As we study different sampling schemes, we will know which are more efficient than SRS and which are less. Here are some examples: Usually more efficient than SRS: - stratified sampling, implicit stratification in systematic selection, use of more efficient estimators (e.g., ratio estimators of total) Generally less efficient than SRS: - cluster sampling (used for convenience and cost effectiveness) The efficiency of a particular sample design is measured by the design effect. XXX (see Chapter 4). 5.10. Relationship Between Size of Sample and Size of Population We return to certain implications of the basic formula from which all the above formulas are derived. That basic formula was given in equation (4.4) in Chapter 4 as: 59 (5.14) Notice that the sampling variance of the mean is equal to the variance of individual observations (S²) in the population multiplied by the factor What happens when the sample increases from its smallest possible size (n = 1) to its largest possible size (n = N)? When n = 1, This states the familiar fact that the variance of the means of samples of one unit is the same as the variance of individual observations in the population. At the other extreme, when n = N, That is, if the sample includes the entire population, the mean is estimated without sampling error. For sample sizes between these extremes, how does the sampling fraction (sampling rate) n/N affect the standard error? The answer, sometimes surprising to students, is that for populations that are large relative to the sample size, the absolute size of the sample (n) and not the sampling fraction n/N determines the precision of the estimated mean. This follows from the fact that when N is large relative to n, the factor [(N-n)/N] .1. (The symbol . stands for "is approximately equal to"). Then thus, it is clear that the error depends on S² and n, and not on On the other hand, for small populations the sampling fraction does have an effect. For example, suppose two populations have the same mean and the same variance: and S² = 100, while N1 = 40 and N2 = 400. If we take the same size of sample from each, say n1 = n2 = 20, the standard errors are related (in an inverse way) to the sampling fractions. Equation (5.14) then gives: N n n/N 1st population 40 20 .50 1.6 2nd population 400 20 .05 2.2 The number of sample units needed to achieve the same precision would be greater for the second (larger) population. However, the number of sample units needed to achieve a given reliability does not increase indefinitely as the number of elements in the population increases. In other words, we reach a point in which adding an extra sampling unit does not produce a sizable reduction in variance. 60 Example Table 5.2 below shows the size of sample necessary to give an estimate of the population mean within a 5 percent error (E = 0.05) of the estimate (with confidence coefficient k = 2) for populations ranging in size from 50 to 10,000,000 elements and with (CV)² = .10 in each case. These results were obtained using equation (4.8) of Chapter 4. Equation 4.8 is given by: Table 5.2 NUMBER OF ELEMENTS NECESSARY FOR FIXED PRECISION: (CV)2 = .10 (E = .05 and k = 2) _ Number of elements in the population (N) Number of elements required in sample (n) n/N 50*..................... 38 .76 100..................... 62 .62 1,000.................. 138 .14 10,000................ 158 .016 100,000.............. 160 .0016 1,000,000........... 160 .00016 10,000,000......... 160 .000016 ____________________________________________ * Use equation (4.8) when N is smaller than 50. As an example, let’s calculate the first value of n in Table 5.2. Since N = 50 and is very small for a population value, we have to use the formula for n that contains the finite population correction factor (N-n)/N. The series of steps leading to the number 38 in Table 5.2 is shown below. 61 The objective is to leave n on one side of the equation in terms of the other components. Now, we know that (CV)2 = 0.10. This is the population coefficient of variation and is given to us as a known value. However, we do not know the value of but we can obtain it by using the following: Consequently, we have the following value for n when N = 50: Table 5.2 shows that for small populations, the sample size needed for a given accuracy does increase as the population increases, but the sample size approaches a fixed number as the population gets very large. The largest size of sample we would ever need for this accuracy (with CV² = .10) is 160 elements, and this is approximately the number we would need whether there are 10,000 or 10,000,000 elements in the population. Furthermore, if we had used a sample of 160 for a population even as small as 1,000, the sample would be somewhat larger than necessary; but the excess would not have been very serious. 62 Chapter 5 Simple Random Sampling Problems 1. State park officials were interested in the proportion of campers who consider the campsite spacing adequate in a particular campground. They decided to take a simple random sample of size n = 30 from the first N = 300 camping parties which visit the campground. Let yi = 0 if the head of the i-th party sampled does not think the spacing is adequate and yi = 1 if he does (i = 1, 2, . . . , 30). Use the data below to estimate P, the proportion of campers who consider the campsite spacing adequate. Find the sampling error of the estimate and its coefficient of variation. Camper Sampled Response yi 1 2 3 . . . 20 30 1 0 1 . . . 1 1 Answers: p = 0.8333; samp. error(proportion) = 0.065653216; cv(proportion) = 7.88% 2. Use the data in Exercise 1 to determine the sample size required to estimate P with a bound on the error of estimation of magnitude E = 0.05. Answer: n = 125 3. A simple random sample of 100 water meters within a community is monitored to estimate the average daily water consumption per household over a specified dry spell. The sample mean and sample variance are found to be and s2 = 1252, respectively. If we assume that there are N = 10,000 households within the community, estimate :, the true average daily consumption, find the sampling error of the mean and its coefficient of variation. Answers: Mean = 12.5; samp. error (mean) = 3.3538361; cv(mean) = 28.31% 4. Using Exercise 3, estimate the total number of gallons of water, T, used daily during the dry spell. Find the sampling error of the total and its coefficient of variation. Answers: Total = 125,000; se(total) = 35,383.61; cv(total) = 28.31% 63 5. Resource managers of forest game lands are concerned about the size of the deer and rabbit populations during the winter months in a particular forest. As an estimate of population size, they propose using the average number of pellet groups for rabbits and deer per 30 foot square plots. Using an aerial photograph, the forest was divided into N = 10,000 thirty foot square grids. A simple random sample of n = 500 plots was taken, and the number of pellet groups was observed for rabbits and for deer. The results of this study are summarized below: Deer Rabbits Sample mean = 2.30 Sample variance = 0.65 Sample mean = 4.52 Sample variance = 0.97 Estimate :1 and :2, the average number of pellet groups for deer and rabbits respectively, per 30 square foot plots. Find the sampling error and the coefficient of variation of each mean. Answers: Mean(deer) = 2.30; se(deer) = 0.035142567; cv(deer) = 1.53% Mean(rabbits) = 4.52; se(rabbits) = 0.042930176; cv(rabbits) = 0.95% 6. A simple random sample of n = 40 college students was interviewed to determine the proportion of students in favor of converting from the semester to the quarter system. If 25 of the students answered affirmatively, estimate the proportion of students on campus in favor of the change. (Assume N = 2000.) Find the sampling error of the proportion and its coefficient of variation. Answers: p = 0.625; se(proportion) = 0.078308752; cv(proportion) = 12.53% 7. A dentist was interested in the effectiveness of a new toothpaste. A group of N = 1,000 school children participated in a study. Prestudy records showed there was an average of 2.2 cavities every six months for the group. After three months on the study, the dentist sampled n = 10 children to determine how they were progressing on the new toothpaste. Using the data below, estimate the mean number of cavities for the entire group and find the sampling error and the coefficient of variation of the mean. Child Number of Cavities in the Three-Month Period 1 2 3 4 5 6 7 8 9 10 0 4 2 3 2 0 3 4 1 1 64 Answers: Mean = 2.0; se(mean) = 0.469039; cv(mean) = 23.45% 8. The Fish and Game department of a particular state was concerned about the direction of its future hunting programs. In order to provide for a greater potential for future hunting, the department wanted to determine the proportion of hunters seeking any type of game bird. A simple random sample of n = 1000 of the N = 99,000 licensed hunters was obtained. If 430 indicated they hunted game birds, estimate P, the proportion of licensed hunters seeking game birds. Find the sampling error and the coefficient of variation of the proportion. Answers: p = 0.43; se(proportion) = 0.015584194; cv(proportion) = 3.62% 9. Using the data in Exercise 8, determine the sample size the department must obtain to estimate the proportion of game-bird hunters, given an error of estimation E = 0.02. Answer: n = 2,300 10. A company auditor was interested in estimating the total number of travel vouchers that were incorrectly filed. In a simple random sample of n = 50 vouchers taken from a group of N = 250, 20 were filed incorrectly. Estimate the total number of vouchers from the N = 250 that have been filed incorrectly, and find its sampling error and coefficient of variation. (Hint: If P is the population proportion of incorrect vouchers, then NP is the total number of incorrect vouchers. An estimator of NP is Np which has an estimated variance given by ) Answers: p = 0.4; se(proportion) = 0.062596864; cv(proportion) = 15.65% 11. A psychologist wishes to estimate the average reaction time to a stimulus among 200 patients in a hospital specializing in nervous disorders. A simple random sample of n = 20 patients was selected and their reaction times were measured with the following results: Estimate the population mean, :, and find the sampling error and the coefficient of variation of the mean. Answers: Mean = 2.1; se(mean) = 0.084852814; cv(mean) = 4.04% 12. In Exercise 11, how large a sample should be taken in order to estimate : with an error of estimation equal to one second? Use 1.0 second as an approximation of the population standard deviation. Answer: n = 4. 13. The manager of a machine shop wishes to estimate the average time that it takes for an operator to complete a simple task. The shop has 98 operators. Eight operators are selected at random and timed. The following are the observed results: 65 Time in Minutes 4.2 5.1 7.9 3.8 5.3 4.6 5.1 4.1 Estimate the average time it takes an operator to complete a simple task and find the sampling error and the coefficient of variation of the average time. Answers: Mean = 5.0125; se(mean) = 0.43556930; cv(mean) = 8.69% 14. A sociological study conducted in a small town calls for the estimation of the proportion of households which contain at least one member over 65 years of age. The city has 621 households according to the most recent city directory. A simple random sample of n = 60 households was selected from the directory. At the completion of the field work, out of the 60 households sampled, 11 contained at least one member over 65 years of age. Estimate the true population proportion, P, and find the sampling error and the coefficient of variation of the proportion. Answers: p = 0.1833; se(proportion) = 0.047876471; cv(proportion) = 26.12% 15. In Exercise 14, how large a sample should be taken in order to estimate P with an error of estimation of 0.08? Assume the true proportion P is approximately 0.2. Answer: n = 84 16. An investigator is interested in estimating the total number of “count trees” (trees larger than a specified size) on a plantation of N = 1500 acres. This information is used to determine the total volume of lumber for trees on the plantation. A simple random sample of n = 100 oneacre plots was selected, and each plot was examined for the number of count trees. If the sample average for the n = 100 one-acre plots was with a sample variance of s2 = 136, estimate the total number of count trees on the plantation and find the sampling error and the coefficient of variation of the estimated total. Answers: Total = 37,800; se(total) = 1689.97; cv(total) = 4.47% 17. Using the results of the survey conducted in Exercise 16, determine the sample size required to estimate T, the total number of trees on the plantation, with an error of estimation E = 1500. Answer: n = 388. 18. You want to design a household survey to estimate average annual income per household. The number of households is 2,000,000. On the basis of the data from a previous census, the population variance of annual income per household is estimated to be 1,000,000 (that is, S = 1000). 66 a. What sample size is necessary to estimate the average annual income with a 95 percent confidence that the result is accurate to plus or minus $100? Answer: n = 385. b. What size sample is necessary to estimate average annual income within plus or minus $50, also at the 95 percent confidence level? Answer: n = 1,537. 19. Refer to the universe of eight farms listed below with known value of land and buildings as follows: Farm 1 - $2026 Farm 2 - $6854 Farm 3 - $1532 Farm 4 - $2180 Farm 5 - $5408 Farm 6 - $9284 Farm 7 - $1438 Farm 8 - $8836 a. List the 28 simple random samples of two farms each, compute the mean for each sample and verify that the average mean of all 28 means is $4,694.75. b. Compute the standard deviation of the 28 means and check that the standard deviation is $2,037 (or 2,036.776). 67 Chapter 6. PRACTICAL CONSIDERATIONS IN SELECTING A SAMPLE _____________________________________________________________________________________________ 6.1 SAMPLING FRAME In order to select a sample, it is necessary to have a sampling frame; that is, a list of all elements (or the equivalent, such as a list of blocks, housing units, etc.) so that the probability of selection of each element can be known in advance. The frame need not be literally a list. In sampling from cards, questionnaires, etc., the documents themselves can be considered as the frame. But it is necessary to know that the file is complete. For example, in sampling from a file of records, one should make sure that no records are out of the file--in use or waiting to be refiled--since such records would not have any chance of selection. Again, in using a population register maintained by local authorities, one should make certain the list is current. For example, the list might not contain all families with married couples. Since new families and those that move around are likely to differ in their characteristics from older and more settled families, a biased sample would result. In using local registers or lists, it may be useful to conduct an actual check of the completeness, on a more or less informal basis. This can be done by going out to the area to be sampled, selecting a few families (or farms or business firms) scattered around the area, and checking to see if they are on the list. If possible, it is better to select families of the type likely to be missing from the list, since this would provide a better test. A rough idea of the adequacy of the list can be obtained in this manner. 6.2 PROBABILITY OF SELECTION OF UNITS Special difficulties arise when some units have more than one chance of selection--for example, when sampling from a file in which some individuals are included more than once; when selecting a sample of families from a sample of individual persons; etc. To illustrate, one might select a sample of school children and use it to select families. It is clear that if one draws a sample of families by first selecting a sample of persons and including the families to which these persons belong, the families will have unequal probabilities of selection, since the larger the family the greater the chance of selection. Similarly, selecting a sample of a business firm's customers by using a record file containing a separate sheet (or card) for each purchase will give customers making more than one purchase a greater chance of selection. To avoid the biases which result from giving some of the units a greater chance of selection than others, it is desirable to restrict the sampling procedure so that each unit has only one chance of selection. For example, when selecting a sample of families, we could make a rule to include the family only if the head of the family is the person selected. Since each family has only one head, each family would have the same chance of selection. The specified person on whom the selection of the family depends need not be the head; he/she could 68 just as well be the oldest person, the youngest child, etc. The only requirement is that each family have one and only one such member. Similarly, in sampling customers, we could restrict the sample by using only the cards with the earliest date for each customer, etc. While the technique described in the preceding paragraph is generally recommended, whether the sample is drawn from a file, a set of questionnaires, or is selected in the field, there are other techniques that might be used. They will provide unbiased estimates of the universe, although they do not strictly satisfy the conditions of simple random sampling. Some of these techniques are: (1) After selecting the initial sample by including all families for which one (or more) person has been selected, we group the sample by size of family. It is clear that families with 2 members have twice the chance of selection as those with 1; families with 3 members have three times the chance of selection; etc. Therefore, instead of interviewing all families in the sample, we interview only ½ of the two-member families; 1/3 of the three-member families; etc. (2) Proceed as above, but interview all families instead of ½, 1/3 etc. However, in tabulating the results, tabulate each size class separately, and multiply the results of the two-person families by ½, the three-person families by 1/3, etc., before adding the results together. 6.3 FRAMES INCLUDING OUT-OF-SCOPE UNITS Sometimes the only available frame is a list which includes some units which are outside the scope of the universe defined for the survey. For example, suppose a special analysis is desired of the census characteristics of males. The only source for sampling is a card file containing cards for all persons both male and female, and it is not feasible to remove all the cards for females. The file can still be used as a frame even though cards for both males and females will be designated by the random selection process. The proper procedure in such a case is to take only the cards for the males selected, and disregard those for the females. Do not substitute. A procedure that is sometimes erroneously used (and may cause serious bias) is to substitute the next "male" card in the file for each "female" card drawn in the sample. There are two things wrong with this method: (1) It results in a higher sampling rate than that specified. Also, the sampling rate actually obtained cannot be calculated unless the total number of males is known. This makes it impossible to use the reciprocal of the sampling rate, N/n, as a multiplier to produce estimates of totals from the sample. (2) A more serious objection to this substitution lies in the biases it may introduce in the selection process. Suppose we have a list of all housing units and we wish to select a sample of occupied dwellings only. If we use a procedure that substitutes the next occupied unit for each vacant housing unit that falls into the sample, occupied units that are neighbors of vacant ones will have two chances of selection--the chance that their own listing entry is selected and the chance that the listing of the neighboring vacant dwelling is selected. If vacant units are more likely to be found in poor and undesirable neighborhoods, this would mean that occupied housing units in such areas would be over-represented in the sample. 69 6.4 SYSTEMATIC SAMPLING The work necessary to draw a simple random sample can be quite burdensome when the number of units to be selected is large. For example, to get a 5 percent sample of 20,000 elements, it would be necessary to select 1,000 random numbers from a table of random numbers and then to select the designated units from the population. In practice, most statisticians prefer a different method. A sample of this size is usually drawn by taking a random number between 1 and 20, then taking every 20th element thereafter. Thus, if the random number is 3, the elements taken will be 3, 23, 43, 63, and so on up to 19,983. The reciprocal of the sampling rate (20 in this case) is called the sampling interval. The method of estimating the mean, total, or a proportion is the same as for simple random sampling. This type of sampling is called systematic sampling. It is not the same as simple random sampling, but it is an acceptable sampling method because the chance of selecting any one element is known and we can calculate the sampling errors. If the elements in the population are arranged in a nearly random order (that is, with very little correlation between successive elements), the results of systematic sampling will be in close agreement with those of simple random sampling. Experience shows that, generally, the two methods will give results of roughly the same accuracy. The systematic sample will often have a somewhat smaller sampling error, since it will make certain the sample will be spread throughout the population. We may make use of the formulas for simple random sampling to evaluate the reliability of estimates from a systematic sample; the result will usually somewhat overstate the standard error for systematic sampling. In other words, we will underestimate, slightly, the reliability of the estimates. There are ways of calculating the standard errors of systematic samples more precisely; however, they are not covered in these chapters. 6.4.1 General Procedure for Selecting a Sample The systematic sample selection procedure consists of the following steps: 1. Assign serial numbers from 1 through N to the population units. 2. Calculate SI = N/n, the sampling interval - for exactness, carry as many decimals as possible you may round if you are doing this without a calculator, but you would be sacrificing exactness for convenience 3. Select a random number (RN) from a table of random numbers between 0 and the SI. This is called a random start (RS) - in the permitted range, exclude zero, but include the sampling interval use as many digits as SI has, including decimals if you are searching through a RN table, pretend the decimal point is not there if you are using a calculator which only provides random numbers between zero and 70 one, multiply this random number by the value of SI in order to get a random number between zero and SI. Remember to keep the decimals, do not round yet. 4. Begin the series of cumulated numbers with RS. Add SI to this first number to determine the second. Then, add SI to the second number to get the third, and so on. - Do not round decimals during the addition process 5. Stop cumulating when the last cumulated number exceeds N (discard this last number) - this should occur when you have cumulated n numbers if you rounded SI before adding, you may not have exactly n 6. Now go back and round all the cumulated numbers up to the next integer 7. On the list of population units, circle the serial numbers that correspond to these integers - These are the selected units. Example 6.1 Suppose that a village contains 285 housing units (HUs) and we wish to select a systematic sample of 12 HUs for a survey. Assume the list is randomly ordered. We want to determine the HUs that will be in the sample. 1. SI = N/n = 285/12 = 23.75 2. RN between 0001 and 2375 is 1979 3. RS = 19.79 4. Series of cumulated numbers: Sample Unit 1 2 3 4 5 6 7 8 9 Selection Number of Selected Unit Actual Unit Selected (Serial Number after rounding up) 19.79 19.79+23.75=43.54 43.54+23.75=67.29 67.29+23.75=91.04 114.79 138.54 162.29 186.04 209.79 71 20 44 68 92 115 139 163 187 210 10 11 12 13 233.54 257.29 281.04 304.79 234 258 282 (Discard) Remarks: Let's see what might have happened if we had not carried the decimals. 1. SI = N/n = 285/12 = 23.75 rounded up to 24. 2. Suppose RN between 01 and 24 is actually 24. 3. RS = 24 4. Results: (1) (2) (3) . . (11) (12) 24 48 72 264 288 (discard). We exhausted the population before reaching our 12 units. This would not have happened if we had kept the decimals (had not rounded up at the beginning), even if our RN was equal to the SI. 6.4.1.2 Useful Variation for Use with Computer Software Packages We accomplish the same results by truncating instead of rounding up. Refer to Section 4.1 above. - In step 3 of Section 4.1, while choosing RN, include zero but exclude SI; - Add 1 to RN to define RS; Then, in step 6, truncate (that is, retain only the integer portion of the number), instead of rounding up. This alternative is convenient when using computer software packages because their rounding functions usually round up to the closest number instead of up systematically. So, it is better to use the integer functions which truncate systematically. Let's look at an example in order to clarify the concepts. Refer to the previous example. 1. SI = N/n = 285/12 = 23.75 2. RN between 0000 and 2374 is 1979. 72 3. RS = 19.79 + 1 = 20.79 4. Series of cumulated numbers: Sample Unit 1 2 3 4 5 6 7 8 9 10 11 12 13 RN + k (SI) Actual Unit Selected (Serial Number after truncating) 20 44 68 92 115 139 163 187 210 234 258 282 (Discard) 20.79 20.79+23.75=44.54 44.54+23.75=68.29 68.29+23.75=92.04 115.79 139.54 163.29 187.04 210.79 234.54 258.29 282.04 305.79 6.4.2 Caution in the use of systematic sampling There is one situation in which systematic sampling will give very poor reliability. That is the case in which the arrangement of the elements in the population follow a very regular (periodic) pattern and the sampling interval of the systematic sample falls into that pattern. For example, suppose all families in a certain population consisted of exactly four persons--the head, his wife, and two children. The population has been listed in the order just given and we wish to draw a 25 percent systematic sample from this list to obtain some special information. Since the sampling procedure is to take every fourth person starting at random, four possible samples could be obtained: (1) Random start is 1--the sample will consist entirely of heads of families. (2) Random start is 2--the sample will consist entirely of wives of heads. (3) Random start is 3 or 4--the sample will consist entirely of children. In a case such as this, results from sample to sample would have nearly the maximum possible variation, and it would be likely that estimates based on any one of the samples would be quite far from the true values for the population. However, even in this extreme case, the estimates would be unbiased; that is, the averages of the estimates for all possible samples would be the population averages. Although the example given above is not likely to occur in practice, approximations to this situation sometimes arise. If there is suspicion of any regularity in the sequence of listing, which could conform to the sampling interval, systematic sampling should be avoided or modified. For example, the list could be randomized before systematic selection is used. 73 6.4.3 Modified systematic sampling One variant of systematic sampling that could be used when there is some systematic ordering in the population is to use a different random number within each sampling interval. To illustrate, let us use the previous example of 25-percent sample when family members are listed in order--head, wife, child. With a systematic sample, once a random number is selected, this sets the pattern for the entire sample. As explained above, if the random number is 1, the sample will be the 1st, 5th, 9th, 13th person, etc. (all heads of families); if the random number is 2, the sample will include the 2nd, 6th, 10th, 14th person, etc. (all wives of heads). To avoid this difficulty, we can select a different random number within each group of 4 persons, so as to avoid a constant interval between our sample cases. The selection scheme is indicated below: Random number (1 to 4) Group of four persons Person selected 3 1st 3rd 1 2nd 5th 2 3rd 10th 1 4th 13th 4 5th 20th etc. That is, in the first group, one child is selected because the random number is 3. In the second group, the husband is selected because the random number is 1 and the husband is the first person in the group, but the fifth person in the list. In the third group, the second person is selected (the wife), who is the 10th person in the list, and so forth. The system requires more work than ordinary systematic sampling, but it avoids the possibility of the patterns indicated above. We do not mean to imply that such patterns as described above usually exist and that systematic sampling should be avoided. In most cases, systematic sampling produces very satisfactory results. 6.4.4 Serial number as a sampling source Frequently, in sampling office files, the records have a serial number. We may take advantage of this fact to draw the sample; for example, by designating all records whose serial numbers end in 5, 7, or some other number chosen from a table of random numbers. However, before deciding on this system, one should make sure that the last digit of the serial number is actually random, and does not represent a nonrandom arrangement of some kind; if it does, we might obtain only one particular type of unit in the sample by repeatedly selecting the same last digit. If such a serial number is not present, frequently one can be assigned at random with little cost, and used for sampling. 74 6.5 GUIDELINES ON WHEN TO USE DIFFERENT SAMPLING SCHEMES 6.5.1 When to Use Simple Random Sampling (SRS) Some situations which suggest the use of SRS are: 1. There are no major cost differences associated with including various classes of sampling units in the sample. 2. The population is relatively homogeneous with respect to the major characteristics being estimated. 3. There is no auxiliary information available for the population units. 4. There are no cost savings in surveying units which are close together or other natural clusters of the population. 5. A sampling frame which lists each population element is available. 6. There is no need to make separate estimates for subdivisions of the population. It should be noted that none of these reasons on its own is enough to justify the use of SRS. 6.5.2 When to Use Systematic Sampling There are several reasons for using systematic sampling, but in practice, the main reason usually is: - to select a SRS quickly (from a randomly ordered list) This type of systematic sampling is suggested for SRS when: 1. The frame is a record system requiring a manual selection of sample units (e.g., a physical list, card files, etc.) 2. Sampling units are arranged in random order. 3. Time and resources for selecting the sample are limited. 4. No periodicity is suspected in the data. Systematic sampling can also be used to provide implicit stratification during sample selection if sampling units are arranged in a particular order. This type of sampling, however, would not be SRS. 75 6.5.3 When to Use Stratification Some situations which suggest the use of stratified sampling are: 1. Natural or predefined strata of the population exist: e.g., geographic divisions such as states, provinces; ecological zones that have great socioeconomic impact on the population, etc.. 2. There exist subpopulations of interest for which separate estimates of a given precision are required. 3. For administrative convenience, such as regional offices of national statistical offices. Strata could be created so that each regional office can handle the sampling and the interviewing in their respective areas. 4. Stratification can provide a reduction in cost. 5. Stratification can provide a reduction in variance. This would occur if a. The variables of interest are correlated with the variable of stratification. b. The potential strata are internally homogeneous with respect to the variables of interest. 6. Auxiliary information upon which to base the stratification is available for all population units. 7. Different sampling strategies are required in different parts of the population. 6.5.4 When to Use Single-Stage Cluster Sampling Some situations which suggest the use of cluster sampling are: 1. Natural or predefined clusters of the population exist: e.g., Metropolitan Statistical Areas (MSAs), Enumeration Districts (EDs), Enumeration Areas (EAs), etc. 2. Confining sampling operations to units that are nearby produces large cost and time savings. 3. No frame is available which lists all population elements but one could be constructed for a limited number of clusters to list all elements in the cluster. 4. Elements within clusters are heterogeneous with respect to variables of interest. 5. Cluster means of the variables of interest are similar among themselves. 6. Cost savings justify the relative loss in precision. 76 7. Nonsampling errors can be controlled more effectively (e.g. listing operation can be done more accurately for a cluster than for the whole population, yielding better coverage). It is generally recommended that clusters be selected either with probability proportional to size or with equal probabilities after stratification by size. In addition, it is recommended that larger clusters be placed in certainty strata so they may all be included in the sample. This is done in order to control the variance of estimates. 6.5.5 When to Use Multi-Stage Sampling The situations which suggest the use of a multistage design are the same as for single stage cluster sampling except that multistage sampling is preferred over single stage sampling when: 1. It is operationally impractical to survey all elements in a cluster, or 2. Only a limited number of sample elements can be handled, and concentrating them in a few clusters would result in estimates of poor precision. In such a case, it would be more efficient to spread the sample over more clusters and only subsample each cluster. Remarks: The above guidelines for using different sampling schemes are not meant to be rigid or exhaustive. In practice, there might be: - multiple survey objectives that conflict with one another, or - survey objectives which conflict with survey resources. Hence, it is usually necessary to compromise in selecting a design or often to combine designs. 6.6 CONTROLS After a sample is selected, it is necessary to check the number of cases actually obtained against the number expected (as calculated by applying the sampling rate to the number of cases in the universe). Discrepancies may indicate that the sampling procedure was not properly carried out. For example, forgetting to sample from file drawers in use at the time of sampling, and thus omitting part of the population, would result in fewer cases than expected. Further checks on whether the sample shows any unusual features may also help us know whether the sampling was actually performed as planned. 6.7 USE OF CHECK DATA IN SAMPLING Very frequently, when a sample has been selected for a study, sample data will be collected and tabulated for a set of basic items for which there are already available known population totals in addition to the items of special interest in the survey. Such known population totals are called "check data" or "independent information." If the sample results for the known items agree closely 77 with the known population totals, it is sometimes claimed that this coincidence "validates" the sample and proves it will provide good results for other items. Actually, this so-called "validation" does not demonstrate that we have a "good" sampling procedure, or that the sample will yield "good" estimates for the other items in the survey. It is only on the basis of a random method of selecting the sample that we are able to attach a sampling error to our statistics, and to evaluate the probability that the estimates will be within specified limits of the true value: therefore, it is obvious that we cannot rely exclusively on such "validation." Nevertheless, there are three acceptable uses of check data: (1) Available check data may be used in improving the method of sampling; for example, in providing a basis for stratification. (This is the subject of the next two chapters.) (2) It is possible to calculate the standard errors of the estimates made from the sample data. If the check data and sample estimates of the same items differ more than might reasonably be expected from the size of the calculated standard errors, this may indicate that the sampling procedures may not have been carried out properly, the sampling frame has coverage errors, or something else may have gone wrong in the implementation of the survey. Further investigation is needed. (3) Check data may be used in improving the method of making estimates from the sample; for example, by adjusting the sample estimate by the ratio of the true value of the check item to the sample estimate of this check item (using a ratio estimate). We will discuss this more fully in later chapters. The above three applications of the use of check data (or independent information) are acceptable, since we can make statistical inferences when using them. 6.8 SAMPLING WEIGHTS IN SRS Recall that the sampling weight of a sample unit is equal to the reciprocal of the probability of selection. In SRS, the probability of selection is (n / N). Therefore, the sampling weight is equal to: Sampling Weight = 6.9 SELF-WEIGHTING AND NON-SELF-WEIGHTING SAMPLES A sample is self-weighting if every unit in sample has the same probability of selection. By their very nature, SRS samples are self-weighting. However, in practice, most complex designs produce non-self-weighting samples. For instance, a higher sampling fraction is used for a the stratum that contains large businesses (sometimes all of them are chosen); in demographic surveys (say health) we may oversample special minority groups in order to obtain better estimates with smaller variances. In addition, almost any self-weighting design becomes non-self-weighting due to 78 adjustments to the basic weights. Exercises 6.1 You have a population of 185 persons. Select a systematic sample of 20 persons. List the numbers assigned to them and describe the procedure you used in the selection. 6.2 Suppose that a city block contains 125 housing units. We wish to select a systematic sample of 10 housing units. Follow the steps we discussed in this chapter to accomplish this. 79 Chapter 7 STRATIFIED SAMPLING-BASIC THEORY __________________________________________________________________________ 7.1 DESCRIPTION OF THE STRATIFICATION PROCEDURE In simple random sampling, we do not try to force the sample to be representative of different groups in the population. The tendency to be representative is inherent in the procedure itself and the sampling error can be reduced only by increasing the size of sample. However, if something is known in advance about a population, it may be possible to use this information in stratification and thus reduce the sampling error. The judgment of experts may be useful here. Stratified random sampling is a method in which the elements of the population are divided into groups (strata), and a simple random sample is selected for each group, taking at least one element from each group (stratum). One element from each group is sufficient to estimate the mean, but two are needed to estimate its reliability; generally many more than two are needed to make the estimates sufficiently precise. The process of establishing these groups is called stratification and the groups are called strata. The strata may reflect regions of a country, densely populated or sparsely populated areas, various ethnic or other groups. In stratification we group together elements which are similar, so that the population variance within stratum h is small; at the same time, it is desirable that the means of the several strata be as different as possible. The letter h will be used to identify the strata so that if L strata are created, h will go from 1 to L. In stratified sampling, the probabilities of selection may be the same from group to group, or they may be different. It is not necessary that all elements have the same chance of selection, but the chance of each must be known. Under stratified random sampling all the elements in a particular stratum have equal chances of being selected. While not every combination of elements is possible, all of the possible samples (that is, combinations of elements) that might be drawn have the same chance of occurring. In stratified sampling, the selection of sampling units, the location and enumeration of the selected units, distribution and supervision of fieldwork and, in general, the whole administration of the survey is greatly simplified. The procedure, however, presupposes the knowledge of the strata sizes, that is, the total number of sampling units in each stratum as well as the availability of a frame for selecting a sample from each stratum. The most important aspect of a good stratification is that it lowers significantly the sampling error of the estimates if the stratification variable is highly correlated to the variables of interest. 80 7.2 NOTATION We use the same notation as for simple random sampling, except that there will be a subscript to indicate a particular stratum when we refer to information regarding this stratum. Thus, N will represent the total number of elements in the population, as before; but N1 will be the number in the first stratum, N2 will be the number in the second stratum, etc. Similarly, n will be the total sample size; n1 will be the size of the sample in the first stratum, n2 will be the size of the sample in the second stratum, etc. The subscript h denotes the stratum and I the unit within the stratum. As in the case of simple random sampling, capital letters refer to population values and lower case letters denote corresponding sample values. The following notation given in the table will be used. Measurement For Population For Sample Sample Estimate Total number of elements N n -- Number of strata L L -- Number of elements in the h th stratum Nh nh -- Total for a certain variable (characteristic) Y y Total of the variable in the h th stratum Yh yh Average over all strata (population mean) Average for h th stratum (stratum mean) -- Proportion having attribute P p p st Proportion in the h th stratum Ph ph -- Population Variance S² -- -- Population variance for the h th stratum -- Variance of an estimated total Variance of an estimated mean Value of a specific unit -- 81 7.2.1 Illustration for a Whole Population Suppose we have a universe of eight farms with known value of land and buildings as follows: Farm Value of land and buildings 1 $2026 2 6854 3 1532 4 2180 5 5408 6 9284 7 1438 8 8836 Let us compute the average (mean) and the standard deviation of these values. In terms of the notation above, we would have N=8 = $4,694.75 S = $3,326.04 Now let us arrange the farms into two strata, so that the groupings of values are as follows: Stratum 1 Stratum 2 $1,438 1,532 2,026 2,180 $5,408 6,854 8,836 9,284 If we compute the average and standard deviation of each group of four farms separately, we would have 82 7.3 Stratum 1 Stratum 2 N1 = 4 N2 = 4 = $1,794 = $7,595.50 S1 = $364.33 S2 = $1,800.45 ESTIMATES FROM A STRATIFIED SAMPLE The population mean can be expressed in terms of the stratum totals, as follows: (7.1) where the population total Since each can be expressed as we may write (7.2) Within each stratum, simple random sampling is used. We saw previously that for simple random sampling, is an unbiased estimate of This suggests that for stratified sampling an estimate of the population mean can be obtained by substituting, for each stratum mean, the corresponding estimate from the sample. That is, the mean of the sample elements from the first stratum gives an estimate of the true mean of the first stratum; the mean of the sample elements in the second stratum gives us an estimate of the true mean for the second stratum, etc. In symbols, therefore, the estimate of the population mean from a stratified sample is denoted by (st for stratified) and is given by : (7.3) Another way of expressing the same formula is (7.4) where yh is the sample total for the hth stratum. 83 7.3.1 Illustration of estimate of mean A stratified sample is drawn from a population of 1,000 farms to estimate average expenditure by farm operators for hired labor. There are three strata--the total number of farms in the first is 300; in the second, also 300; and in the third, 400. The selected samples have 30, 30, and 40 farms in the three strata respectively. The average expenditure for the 30 farms in the first stratum is $12.20; for the 30 farms in the second stratum, $25.60; and for the 40 farms in the third stratum, $48.70. For the sample estimate of the average expenditure for all farms in the population we would have 7.3.2 Estimate of total As with simple random sampling, we make an estimate of the population total by multiplying the estimate of the mean by the total number of elements in the population: (7.5) 7.3.3 Estimate of proportion To estimate a proportion for the population, the procedure is similar to that for the mean because a proportion, Pst is simply a special case of the mean when the only possible values of are 0 and 1. In this case, for stratified random sampling. The true population proportion Pst is given by and it is estimated by (7.6) 84 7.4 SAMPLING ERROR OF A STRATIFIED SAMPLE The sampling errors of the three types of estimates referred to above are computed by using equation (7.7) for the mean, equation (7.8) for the total, and equation (7.9) for a proportion: (7.7) where (7.8) (7.9) The corresponding formulas for the estimated sampling error for each type of estimate are: (7.10) where the standard error of the sample is, (7.11) 85 (7.12) Similar formulas can be derived for the coefficient of variation by dividing the above expressions by the value of the item being estimated. Thus, for example: (7.13) The formulas for confidence intervals of the population mean and the population total are: (7.14) (7.15) These formulas assume that is normally distributed and that is well determined, so that the multiplier t can be read from tables of the normal distribution (see Appendix I). If only a few degrees of freedom ( less than 30) are provided by each stratum the t-value should be taken from the tables of student's t (see an Appendix II) instead of the normal table. 7.4.1 Illustration Let us apply equation (7.7) to the case of the eight farms in the illustration in section 2. Suppose we took a sample of four farms out of the eight--two from each stratum--and we have computed by equation (7.3). What is the sampling error of In the two strata, the values would be: Stratum 1 Stratum 2 N1 = 4 N2 = 4 n1 = 2 n2 = 2 364.33 = 1,800.45 86 = 132,736.35 = 3,241,620.2 Applying equation (7.7) It is interesting to compare this sampling error with the corresponding sampling error of the mean for a simple random sample of four farms. For a simple random sample of four farms, we would have In this example, the sampling error of the stratified sample is much smaller than that of the simple random sample, less than half. In fact, it would require a sample of six farms, using simple random sampling, to achieve the same reliability (that is, as small a sampling error) as we obtained with a stratified sample of the four farms. 7.4.2 Remarks In actual practice, we usually do not know the true values and Instead, we substitute sample estimates of these values into equations (7.7), (7.8), and (7.9) to obtain equations (7.10), (7.11) and (7.12), respectively. To make such estimates from a single sample, we would need at least two elements from each stratum. (In the examples described above, we were able to compute the standard error for samples having only one element per stratum because we had information on all elements in the universe.) To derive equation (7.7), we do the following: (7.16) Apply the variance operator to each side of equation (7.16). 87 (7.17) (7.18) (7.19) (7.20) But for h = 1, 2, 3, . . . , L. Therefore, we can write (7.20) as: (7.21) Equation (7.21) is equivalent to equation (7.7). We will now rewrite equation (7.21) in a different way to make some observations. (7.22) which can also be written the following way: (7.23) From equation (7.22) we can see that if the fpc = 1, i.e., if (7.22) becomes: 88 then equation (7.24) Equation (7.23) has two components. The first component is shown in equation (7.24) and it represents the variance of the mean when sampling is done with replacement, that is, when the fpc = 1. The second term in equation (7.23) represents the adjustment that one needs to make when sampling is done without replacement. We can also see from equation (7.24) that the variance of the mean is directly proportional to the strata population variance. That is, the smaller the population variance in the strata, the smaller the variance of the mean. In other words, the more homogeneous the strata, the smaller the overall variance of the mean with stratified sampling. 89 Exercises 7.1 Suppose you have a population of 12 persons whose hourly earnings are as follows: Person Hourly Wage 1 2 3 4 5 6 7 8 9 10 11 12 0.85 1.35 0.60 2.20 1.80 3.10 0.90 1.50 1.75 0.75 2.40 2.10 a. What is the average (mean) hourly earnings for this group? b. What is the sampling error of the mean for a sample of six persons selected as a simple random sample? c. Stratify this population into three strata of equal size in the best way to estimate average earnings. List the persons in each stratum by their hourly earnings. d. Select a sample of six persons--two from each stratum. Suppose from stratum I we obtain in sample the values (0.60) and (0.90); from stratum II we obtain the sample values (1.35) and (1.50); and from stratum III we get (3.10) and (2.10). (a) Estimate the average (mean) hourly earnings for this sample. (b) Obtain the sampling error of the estimated mean. 90 Chapter 8 STRATIFIED SAMPLING-ALLOCATION TO STRATA ______________________________________________________________________________ 8.1 THE PROBLEM OF ALLOCATION The definition of stratified sampling does not specify a particular size of sample in a stratum. The sample can be selected so as to have the same size in each stratum, or it can be distributed in some other way. As long as we select at least one element per stratum, the specification for a stratified sample is satisfied; and with two elements per stratum we can estimate both the mean and its error. Usually the total sample size is much larger than two elements per stratum. Hence, the question arises as to what criterion should be used in allocating the total sample among the strata. Let us return to the earlier example of a population of eight farms in two strata. If we wish to select a sample of two farms to estimate the mean, we have no choice but to take one farm from each stratum. Suppose, however, that we wish to select four farms. Then we have a choice in the allocation of the sample. Would it be better to select two farms from each stratum or take one farm from one stratum and three farms from the other? There are two important criteria for determining how the sample should be distributed among the various strata. The first criterion is convenience; that is, choose a method which is easy to apply and simple to tabulate. This usually leads to the use of proportionate or proportional (allocation) stratified sampling. The second criterion is precision: choose a method which will provide the smallest sampling variance (or sampling error). This leads to the use of optimum allocation. 8.2 PROPORTIONATE STRATIFIED SAMPLING It is very common in stratified sampling to select the same proportion of units in each stratum. With this method, to take a 10-percent sample of a given population, we would take a 10-percent sample from each stratum. Since the sampling rates in all strata are the same, the number of elements taken for the sample will vary from stratum to stratum, depending on the size of the stratum. Within each stratum, the sample size will be proportionate to the total population of the stratum. We can express this mathematically as follows: or alternatively For the population characteristics that we are usually interested in (namely, Y and we can prepare estimates from a proportionate stratified sample as easily as from a simple random sample-in fact, by using the same formula 8.1) In this formula, the sum is for all sample elements without regard to strata; since (Nh/nh) is a constant, and equal to (N/n), equation (7.3) of chapter 7 reduces to this form. We also have (8.2) where I in equations (8.1) and (8.2) refers to individual observations. The simple weighting procedure makes proportionate sampling attractive since results are easy to tabulate. Different strata do not have to be tabulated separately. All of the sample data can be added together before application of any factors such as (1/n) or (N/n). A sample which has this feature is self-weighting. That is, in a self-weighting sample, every individual observation has the same probability of selection and, consequently, the same weight. The true standard error of the mean estimated from a proportionate stratified sample is (8.3) When we substitute in equation (8.3), this becomes, (8.4) (8.5) Proportional allocation has many advantages: 1. In order to use this allocation procedure we don't need to know the stratum variances (as the methods we'll discuss later do). 2. Other methods require us to know the costs of sampling units in the different strata, but not this method. 3. 4. The increase in precision from other more elaborate methods is not very large. Efficient for national-level estimates. 92 However, we will see later on that when there is a very large variation in the stratum variances, the gain in precision obtained by other methods may outweigh the simplicity of proportional allocation. However, as shown later, this method is widely used in applied sample design. 8.3 OPTIMUM ALLOCATION Sometimes we have to conduct a survey with a fixed amount of money and we may be faced with the fact that the cost of sampling units in different strata differs widely. For instance, it is a well-known fact that sampling units in rural areas is generally more expensive than urban areas, because the distances are longer and sometimes sampling units are more difficult to find. The term optimum allocation refers to the optimum (the most efficient) way of allocating the total sample (n) to the different strata. The formula is given by: where ch is the cost of sampling one unit in stratum h. The above formula is obtained by finding the values of nh that will minimize subject to the linear constraint When the costs of sampling in the different strata are the same, the optimum allocation formula is called Neyman allocation, after Jerzy Neyman (1934), who investigated mathematically the question of what distribution of the sample among strata would give the smallest possible sampling error. He found that the answer was to let the sampling rate in each stratum vary according to the amount of variability in the stratum--in other words, to make the sampling rate in a given stratum proportional to the standard deviation in that stratum. The number of elements to be sampled from any stratum, then, would depend not only on the total number of elements in that stratum, but also on the standard deviation of the characteristic to be measured. For Neyman allocation, the number to be selected within a stratum is given by the following formula: (8.6) With Neyman allocation, the formula for the variance of the mean (after using (8.6) in formula (8.3)) reduces to (8.7) The second term on the right represents the use of the fpc. 93 As before, the standard error of the total is given by the following formula: (8.8) For this type of allocation, it is necessary to know the values of Sh in the universe. If these are not known in advance, then X may be estimated within each stratum, by using the methods described in Chapter 5.4, (p. 53). Note that in formula (8.6), when the Sh are all equal, Neyman allocation becomes proportionate allocation. 8.3.1 Illustration Let us compare the standard errors arising from proportionate and optimum allocation in the same survey. In 1942, a census of lumber production was taken in the United States. In 1943, the survey was to be repeated, but on a sample basis. Before selecting the sample, mills were grouped into strata, on the basis of their 1942 production; an analysis of the data produced the information presented in Table 8.1. Table 8.1 BASIC DATA FOR DETERMINING OPTIMUM ALLOCATION (Production figures and standard deviations given in thousands of board feet) 1942 Stratum Annual Production 1 5,000 and over 538 11,029.7 9,000 2 1,000 to 4,999 4,756 1,779.6 1,200 3 Under 1,000 30,964 203.8 300 36,258 571.2 Total Number of mills (Nh) Average production in stratum Standard deviation for 1943 (Sh)* **1,684 *Estimated from 1942 data. **For unstratified sampling. Now let us select a sample of 1,000 mills. The first question to consider is how to determine the sample size in each stratum, under either proportionate sampling or optimum allocation sampling. The second question to consider is the resulting reliability of the two methods. Let us consider first the matter of the sample size, then the matter of reliability. 94 8.3.2 Sample Size in Each Stratum For proportionate allocation, since the sampling rate is 1,000 out of 36,258, this rate is used in each stratum. The sample sizes, therefore, would be: n1 = x 538 = 15 n2 = x 4,756 = 131 n3 = x 30,964 = 854. For optimum allocation, the sample size in each stratum would be determined by the following table. Table 8.2 SAMPLE SIZE FOR OPTIMUM ALLOCATION Stratum Number of mills (Nh) Standard Deviation (Sh) NhSh 1 538 9,000 4,842,000 2 4,756 1,200 3 30,964 300 Total 36,258 Number in sample (nh)* Sampling rate 0.244 244 ½ 5,707,200 0.288 288 1/16 9,289,200 0.468 468 1/66 19,838,400 1.000 1,000 *nh 8.3.3 Standard Errors What are the standard errors for these two sample designs? For proportionate allocation, the standard error of the estimate of the mean is given by equation (8.4): 95 For the survey of lumber production, and thousand board feet. For optimum allocation, the corresponding standard error is given by equation (8.7): thousand board feet. To complete the analysis, one may compare these results with those obtained if we had not stratified the mills, but had taken a simple random sample of 1,000 mills from the universe. In this case, the standard error is given by: thousand board feet. 8.4 COMPARISON OF SAMPLING ERRORS WITH DIFFERENT SAMPLING METHODS Examining the results of the sample designs above, we see that optimum allocation gave us a standard error of 16.1 thousand board feet, considerably smaller than that under proportionate sampling, which was 37.8; we see also that the sampling error under proportionate sampling was smaller than that under simple random sampling, which was 52.5. Putting the results another way, it would require a proportionate sample more than 5 times as large as an optimum allocation sample to achieve the same reliability. Simple random sampling would require a sample 10 times as large. The efficiency of optimum allocation results from the fact that it provides for more intensive sampling in strata having large standard deviations, which can be expected to contribute more heavily to the total sampling error. The example in section 8.3 above illustrates a general result which can be demonstrated mathematically. The sampling errors of the three types of designs are approximately related in the 96 following way (if the sampling rates are small enough so that the finite correction factors can be ignored): (8.9) (8.10) where is a weighted average of the values of Sh and and are, respectively, the variances of the estimated means based on simple random sampling, optimum and proportionate sampling. An examination of this formula shows that sampling errors obtained with optimum allocation will be at least as small, and usually smaller, than those obtained with proportionate stratified sampling. Furthermore, the errors obtained with either of these methods will be at least as small, and generally smaller, than those obtained with simple random sampling. (There are a few rare cases, which almost never occur in practice, in which this is not true. When the sample is very small and the stratification is completely ineffective, neither proportionate sampling nor optimum allocation may show a gain over simple random sampling. For all practical purposes, this possibility can be ignored.) Consider the conditions under which important differences result from the three methods. When we compare proportionate stratified sampling with simple random sampling, it can be shown that the gain in reliability depends on the amount by which the means of the strata vary; the greater the variation between the means (in other words, the greater the differences among the strata), the more the reduction in the standard error arising from the use of proportionate sampling. On the other hand, if the variance between stratum means is fairly small compared to the total variance, not much will be gained by stratification. As a result, stratification is usually less important in dealing with proportions than with measured items (or with aggregates or quantities). For example, it would be of much greater help in trying to estimate the average expenditure of farmers for hired labor than in estimating the proportion of farmers who hire labor. Even for measured items the gains would be slight unless the strata are established so that the differences between the means are sizable (as was the case in the example of lumber mills). For example, in conducting a survey to measure personal income, it would probably not pay to establish separate strata for different professional groups--for example, doctors, lawyers, etc. It probably would be useful, however, to set up separate strata for broader groups-laborers, businessmen, professionals, etc. Since proportionate sampling is nearly always better than simple random sampling, stratification is recommended whenever it can be accomplished with little additional work. Comparing optimum allocation with proportionate allocation, we see that if the standard deviations in all strata are the same, the two methods are identical. The greater the differences between the standard deviations in the strata, the greater the reduction in sampling error to be expected from 97 optimum allocation. Unless the range among the standard deviations is greater than 2 or 3 to 1, the gains of optimum allocation are so small that they are probably not worth the extra complications in tabulation. With larger variations in standard deviations, the gains are appreciable and optimum allocation is advisable. In the example of lumber mills, the standard deviation for stratum 1 was 30 times as large as that for stratum 3. We need to know the Sh for each stratum either (a) to apply optimum allocation or (b) to estimate the errors of proportionate stratified samples. Of course, in practice, we never really know each Sh and must estimate it. Two questions arise: (a) How is the accuracy of the sample affected by the errors introduced by estimating Sh instead of knowing the true value? (b) What methods can be used to estimate these quantities? In answer to the first question, if our estimates of the standard deviation are fairly reasonable (for example, accurate to within 30% or 40%) we will obtain almost all of the gains of optimum allocation. The reason for this is that the sampling error does not increase very rapidly as the allocation departs from the optimum within fairly broad limits. (It should be noted that poor guesses of the values of Sh do not introduce any biases in the result; they only increase the sampling errors.) However, if the estimates of Sh are very unreliable, the "optimum allocation" may have a larger variance than proportionate allocation. In this case, it is safer to use proportionate allocation. In regard to the second question, we can use the methods for estimating the standard deviations described previously (Section 5.4 of Chapter 5). One additional method that is sometimes used is to assume that the standard deviations for the strata are proportional to the average values within the strata; that is, assume the same relative standard deviation in each stratum. (Note that for optimum allocation, it is not necessary to know the absolute values of the standard deviations; it is only necessary to know their values relative to each other.) This assumption will frequently give results reasonably close to the optimum. In the case of the lumber mills discussed previously, this would give us a sample with the following distribution by strata: It can be seen that this allocation is much closer to optimum allocation than is proportionate allocation. In fact, if the standard error of this allocation is computed, it turns out to be 17.3. This is not quite as good as the 16.1 for optimum allocation, but it is far superior to the 37.8 obtained with proportionate sampling. 98 8.5 OPTIMUM ALLOCATION WITH VARIABLE COSTS The discussion of optimum allocation thus far has been in terms of getting the most reliable results for a given total sample size. It frequently happens that the costs of obtaining information vary substantially from stratum to stratum. To give an example, let us suppose that families have been stratified by urban and rural residence; furthermore, suppose that the cost of conducting a rural interview is five times as great as that of an urban interview. It would be wise to concentrate more of the sample in the cheaper stratum. Another example would be a sample survey of business firms; we may mail questionnaires to small companies and visit large ones personally, when there are large differences in unit costs. A more general approach than the one which is described in section 4 above is to consider the optimum allocation for a fixed cost, rather than for a fixed sample size. In other words, we would like to allocate the sample among strata in such a way as to achieve the lowest standard error with a fixed budget. For this we need a cost function, which is a mathematical formulation expressing the cost of taking the survey in terms of the sample sizes, nh. Suppose the average cost for a single questionnaire in the hth stratum is called Ch. Thus C1 is the cost per questionnaire in the first stratum, C2 is the cost in the second stratum, etc. Ch represents the total cost of a questionnaire in the hth stratum, including the cost of interviewing, coding, data entry, etc. (There may also be an overhead cost for the survey which does not depend on the size of the sample, but it is not necessary to consider this in the cost function.) The total cost of the survey which can be affected by the sample size is For a fixed cost C, the optimum allocation of the sample turns out to be (8.11) Note: To use this formula, n must first be calculated. Note that nh is a function of the C h 's, Sh 's, and Nh 's. See Sample Survey Methods and Theory, Volume I: Methods and Applications, by Hansen, M.H., Hurwitz, W.N., and Madow, W.G. New York, Wiley and Sons, 1953, p. 221. That is, nh is directly proportional to Nh and to Sh, and inversely proportional to Formula (8.11) leads to several rules. In a given stratum, we would take a larger sample under the following conditions: (1) If the stratum is larger than the average stratum. (2) If the stratum is more variable internally than the average stratum. 99 (3) If the cost of collection and processing is cheaper than in the average stratum. In regard to the third point, the cost per stratum (Ch) enters into the formula in the form of a square root. This tends to reduce the effect of the differences in unit cost. Unless the costs vary by a factor of at least 2 to 1, using the formula above will give results not very much different from the simpler Neyman allocation given in equation (8.6). In equation (8.11), we do not yet know the value of n. If cost is fixed, substitute the value of nh from (8.11) in and solve for n. This gives (8.12) If, however, an estimate with a specified variance is required, n is given by (8.13) Note that in case Ch = c, that is, if the cost per unit is the same in all strata, then the cost becomes and also equation (8.11) reduces to equation (8.6), which is the formula for Neyman allocation. That is, optimum allocation for fixed cost reduces to optimum allocation for fixed sample size. 8.5.1 Illustration Suppose a sampler proposes to take a stratified random sample. He expects that his field costs will be of the form His advance estimates of relevant quantities for the two strata are as follows: Stratum 1 Stratum 2 N1 = 1,056 S1 = 10 C1 = $4 N2 = 1,584 S2 = 20 C2 = $9 (a) Find the sample size required under optimum allocation, to make fpc. Ignore the (b) Determine the sample size for each stratum (i.e., the allocation of the total sample size n to each of the two strata). 100 (c) How much will the total field cost be (excluding overhead costs)? Solution of (a) For optimum allocation, the formula for n is given by: where Recall that to 2 (it should be 1.96 to be exact). Therefore, In our example, E = 2 since k is taken to be equal Solution of (b) The sample size in each stratum is given by equation (8.11). For the sample size in the first stratum: Similarly, n2 = 159. The total field cost is given by: C = C1 n1 + C2 n2 = 4 (80) + 9 (159) = $1,751 8.6 OPTIMUM ALLOCATION FOR SEVERAL ITEMS The formula for the optimum allocation of the sample (equation 8.6 or 8.11) is necessarily computed for a single characteristic or variable, Y. If it is desired to obtain the most favorable sample allocation for several characteristics, some kind of compromise must be made. Some alternatives are: (1) Determine the most important item (or group of highly correlated items) and allocate the sample to get the best estimate for this item. (2) Follow the procedure in (1) and increase the size of the sample in some strata to provide adequate coverage of other important items. (3) Set up a function which assigns a weight to each item according to its importance; use this 101 function in the allocation to prevent poor sample estimates for the most important characteristics. Optimum allocation is most effective for characteristics which vary widely for the individual units; such as amount of personal income, number of board feet produced by a sawmill, kilos of maize harvested on a farm, etc. In sampling for attributes, however, such as the proportion of the population in a class (for example, in the income class $1,000 - $1,999), proportionate sampling may be the best allocation. It has the added advantage of being self-weighting. 8.7 STRATIFIED SAMPLING FOR PROPORTIONS Before concluding this chapter, some comments will be made on the problem of sample allocation when the object is to estimate a population proportion P. From equation (7.9 of chapter 7, we have for stratified random sampling, (8.14) with proportional allocation, (8.15) Ignoring the fpc, (8.16) For the sample estimate of the variance, substitute for the unknown in any of the formulas above. If the optimal allocation can be used, nh will be chosen proportional to This allocation will differ substantially from proportional allocation only if the quantities differ considerably from stratum to stratum. For example, let the Ph lie between 0.3 and 0.7, in which case will lie between 0.46 and 0.50. In this situation the optimum allocation will not be preferred to proportional allocation when the simplicity of the computations involved is another factor to be taken into account. 102 We can choose nh in order to minimize the variance Minimum variance for fixed total sample size. where represents "proportional to". Thus, (8.17) Minimum variance for fixed cost. (8.18) where cost = The value of n is found by substituting 8.7.1 in equation (8.12) or (8.13). Illustration In a firm, 62% of the employees are skilled or unskilled males, 31% are clerical females, and 7% are supervisory. A sample of 400 employees is taken from a total of 7,000 employees. Based on the sample, the firm wishes to estimate the proportion that uses certain recreational facilities. Rough guesses are that the facilities are used by 40 to 50% of the males, 20 to 30% of the females, and 5 to 10 % of the supervisors. How would you allocate the sample among the three groups ? What would the standard error of the estimated proportion pst be? Ignore the fpc. We have, N = 7,000 N1 = 4340, n = 400 N2 = 2170 and N3 = 490 We guess P1 = 45%, P2 = 25% and P3 = 7.5% as a compromise. Using equation (8.17), we can allocate the total sample size (n = 400) to the different strata, as follows: 103 Similarly, n2 = 116 and n3 = 16 The standard error is given by the equation (8.16): 8.8 DETERMINATION OF SAMPLE SIZE n In simple random sampling, we saw that the determination of n depended on the sampling variance of the estimator. In a similar way, for stratified sampling, we need to know the formulas for the sampling variances of the different methods of allocation in order to determine n for each one of these methods. Let's summarize the methods of allocation. 1. Equal samples from each stratum. 2. Proportionate allocation. 3. Optimum allocation: fixed budget, varying sampling costs among strata. 4. Neyman allocation: fixed sample size, equal sampling costs among strata.We also saw that the stratum sample sizes nh for these methods of allocation were given by: (8.19) (equal samples) 104 (8.20) (8.21) (8.22) (proportionate) (optimum) (Neyman) To determine the sample size we need to know the variances of these methods. So, we start with the formula for the variance of a mean when using stratified random sampling. Recall that the formula is given by: (8.23) Now we substitute the different values of nh into formula (8.23). After we do this, we get: (8.24) (8.25) (8.26) (8.27) Now, let's see how to determine the sample size n to estimate the mean with an error of estimation E. The sample size is directly related to the error we are willing to tolerate (or the precision we are required to obtain) in our estimates. As before, we define the error the following way: 105 Error of estimation = E = k S( where k is the level of reliability. reliability k, we can write: ) So, given the precision E that we need to obtain and the level of (8.28) We know that as n increases, the variance of the estimate becomes smaller. Therefore, we need to find the sample size n that will give us a variance equal to B2. Let's try to solve for n in equation (8.24), that is, when we have equal samples. (8.29) Multiply each side of equation (8.29) by N2 and leave the term which contains n on one side of the equation. After we do this, we obtain: (8.30) When we solve for n, we obtain: (8.31) Now, when (nh/Nh) is very small (negligible), the fpc = 1 and we may omit from the denominator of equation (8.31) the term Applying a similar procedure to equations (8.25), (8.26), and (8.27), we obtain the sample size n given by the following formulas: (8.32) 106 (8.33) (8.34) As before, when the fpc = 1, the denominator in equations (8.32), (8.33) and (8.34) only contains the term N2 B2. Another important point to mention is that all the formulas for n have been given in terms of the stratum population variances (Sh). In practice, we don't know this value and it has to be estimated by means of a sample or from other sources. 107 Stratified Random Sampling 1. A chain of department stores is interested in estimating the proportion of accounts receivable that are delinquent. The chain consists of four stores. To reduce the cost of sampling, stratified random sampling is used with each store as a stratum. Since no information on population proportions is available before sampling, proportional allocation is used. From the table given below, estimate P, the proportion of delinquent accounts for the chain, find its sampling error and calculate the coefficient of variation of the estimate. Stratum I II III IV Number of accounts receivable N1 = 65 N2 = 42 N3 = 93 N4 = 25 Sample size n1 = 14 n2 = 9 n3 = 21 n4 = 6 Sample Proportion of delinquent accounts Answer: Proportion = .3004; Sampling Error = 0.057975451; 2. A corporation desires to estimate the total number of man-hours lost, for a given month, because of accidents among all employees. Since laborers, technicians, and administrators have different accident rates, it is decided to use stratified random sampling with each group forming a separate stratum. Data from previous years suggest the following variances for the number of man-hours lost per employee in the three groups and current data give the following stratum sizes: I (Laborers) II (Technicians) N1 = 132 N2 = 92 III (Administrators) N3 = 27 Determine the Neyman allocation for a sample of n = 30 employees. Answer: n1 = 18; n2 = 10; n3 = 2 3. For Exercise 2, estimate the total number of man-hours lost during the given month and place a bound on the error of estimation. Use the following data obtained from sampling 18 laborers, 10 technicians, and 2 administrators: I (Laborers) 8 0 6 7 9 18 24 16 0 4 5 2 II (Technicians) 0 32 16 4 8 0 4 0 8 3 1 5 24 12 2 8 108 III (Administrators) 1 8 Answer: Total = 1903.90; Error = 676.80 4. A zoning commission is formed to estimate the average appraised value of houses in a residential suburb of a city. It is convenient to use the two voting districts in the suburb as strata because separate lists of dwellings are available for each district. From the data given below, estimate the average appraised value for all houses in the suburb and place a bound on the error of estimation (note that proportional allocation was used): Stratum I Stratum II N1 = 10 N2 = 168 n1 = 20 n2 = 30 Answer: Mean = 13208.63; Error = 560.485 5. A corporation wishes to obtain information on the effectiveness of a business machine. A number of division heads will be interviewed by telephone and asked to rate the equipment on a numerical scale. The divisions are located in North America, Europe, and Asia. Hence, stratified sampling is used. The costs are larger for interviewing division heads located outside of North America. The following costs per interview, approximate variances of the ratings, and Ni’s have been established: Stratum I (North America) Stratum II (Europe) Stratum III (Asia) c 1 = $9 c 2 = $25 c 1 = $36 N 1 = 112 N 2 = 68 N 3 = 39 The corporation wants to estimate the average rating with which achieves this bound and find the appropriate allocation. Choose the sample size, n, Answer: n = 27; n1 = 16; n2 = 7; n3 = 4 6. A school desires to estimate the average score that would be obtained on a reading comprehension exam for students in the sixth grade. The school has students divided into three tracks, with the fast learners in track I and the slow learners in track III. It was decided to stratify on tracks since this method should reduce variability of test scores. The sixth grade contains 55 109 students in track I, 80 in track II, and 65 in track III. A stratified random sample of 50 students is proportionally allocated and yields simple random samples of n1 = 14, n2 = 20, y n3 = 16 from tracks I, II, and III, respectively. The test is administered to the sample of students with the following results: Track I 80 68 72 85 90 62 61 92 85 87 91 81 79 83 Track II 85 48 53 65 49 72 53 68 71 59 82 75 73 78 69 81 59 52 61 42 Track III 42 36 65 43 53 61 42 39 32 31 29 19 14 31 30 32 Estimate the average score for the sixth grade, and place a bound on the error of estimation. Answer: Mean = 59.99; Error = 3.032 7. Suppose the average test score for the class in Exercise 6 is to be estimated again at the end of the school year. The cost of sampling are equal in all strata, but the variances differ. Find the optimum (Neyman) allocation of a sample of size 50 using the data in Exercise 6 to approximate the variances. Answer: n1 = 11; n2 = 21; n3 = 18 8. Using the data in Exercise 6, find the sample size required to estimate the average score with a bound of 4 points on the error of estimation. Use proportional allocation. Answer: n = 33 9. Repeat Exercise 8 using Neyman allocation. Compare the result with the answer to Exercise 8. Answer: n = 32 10. A forester wants to estimate the total number of farm-acres planted in trees for a state. Since the number of acres of trees varies considerably with the size of the farm, it is decided to stratify on farm sizes. The 240 farms in the state are placed in one of four categories according to size. A stratified random sample of 40 farms, selected using proportional allocation, yields the following results on number of acres planted in trees: 110 Stratum I 0-200 acres Stratum II 200-400 acres Stratum III 400-600 acres Stratum IV 600+ acres N1 = 86 n1 = 14 N2 = 72 n2 = 12 N3 = 52 n3 = 9 N4 = 30 n4 = 5 97, 67, 42, 125, 25, 92, 105, 86, 27, 43, 45, 59, 53, 21 125, 155, 67, 96, 256, 47, 310, 236, 220, 352, 142, 190 142, 256, 310, 440, 495, 510, 320, 396, 196 167, 655, 220, 540, 780 Estimate the total number of acres of trees on farms in the state, and place a bound on the error of estimation. Answer: Total = 50505.60; Error = 8663.124 11. The study of Exercise 10 is to be made yearly with the bound on the error of estimation 500 acres. Find an approximate sample size to achieve this bound if Neyman allocation is to be used. Use the data in Exercise 10. Answer: n = 156 12. A psychologist working with a group of mentally retarded adults desires to estimate their average reaction time to a certain stimulus. He feels that men and women probably will show a difference in reaction times so he wants to stratify on sex. The group of 96 people contains 43 men. In previous studies of this type it has been observed that the times range from 5 to 20 seconds for men and 3 to 14 seconds for women. The costs of sampling are the same for both strata. Using optimum allocation find the approximate sample size necessary to estimate the average reaction time for the group to within 1 second. Answer: n = 29 13. A county government is interested in expanding the facilities of a day-care center for mentally retarded children. The expansion would increase the cost of enrolling a child in the center. A sample survey will be conducted to estimate the proportion of families with retarded children that would make use of the expanded facilities. The families are divided into those who use the existing facilities and those who do not. Some families live in the city in which the center is located and some live in the surrounding suburban and rural areas. Thus, stratified random sampling is used with users in the city, users in the surrounding county, nonusers in the city, and nonusers in the county forming strata 1, 2, 3, and 4, respectively. Approximately 90% of the present users and 50% of the present nonusers would use the expanded facilities. The cost of obtaining an observation from a user is $4.00 and from a nonuser is $8.00. The difference in cost is due to the fact that nonusers are difficult to locate. Existing records give N1 = 97; N2 = 43; N3 = 145; N4 = 68. Find the approximate sample size and allocation necessary to estimate the population proportion with a bound of 0.05 on the error of estimation. 111 Answer: n = 158; n1 = 39; n2 = 17; n3 = 69; n4 = 33 14. The survey in Exercise 13 was conducted and yields the following proportion of families who would use the new facilities: p1 = .87, p2 = .93, p3 = .60, p4 = .53 Estimate the population proportion, P, and place a bound on the error of estimation. Was the desired bound achieved? Answer: Proportion = .701; Error = .0503 15. Suppose in Exercise 13 the total cost of sampling is fixed at $400. Choose the sample size and allocation which minimizes the variance of the estimator, pst, for this fixed cost. Answer: n = 62; n1 = 17; n2 = 6; n3 = 26; n4 = 13 16. The following data show the stratification of all the farms in a county by farm size and the average acres of corn per farm in each stratum. a. Average Corn Acres Farm Size (acres) Number of farms Nh Standard deviation Sh 0-40 394 5.4 8.3 68.89 3,270.20 27,142.66 41-80 461 16.3 13.3 176.89 6,131.30 81,546.29 81-120 391 24.3 15.1 228.01 5,904.10 89,151.91 121-160 334 34.5 19.8 392.04 6,613.20 130,941.26 161- 430 52.0 28.6 817.96 12,298.00 351,722.80 Total or mean 2,010 26.3 17.0 289.00 34,216.80 680,505.02 NhSh For a sample of 100 farms, allocate the sample size to each stratum under: (i) Proportional allocation (ii) Optimum allocation (iii) Equal allocation b. For a sample of 100 farms, compute the sampling error of the estimated total for (i) a simple random sample (ii) proportional allocation (iii) Neyman allocation c. On the basis of this analysis, which of the three methods of allocating the sample would you recommend? 112 17. It is desired to estimate the total value of farm products for a population of 5,900 farms. Means and variances are available from a past census on the value of farm products classified by farm size and tenure of the operator: S i z e a nd tenure A l l farms Number of farms (N h ) 5,900 Average value of products Variance Standard deviation (S h) 3,500 97,000,000 N hS h 9,848.86 S i z e o f F arm U n d e r 1 0 acres 1 0 t o 4 9 acres 5 0 t o 9 9 acres 1 0 0 t o 1 7 9 acres 1 8 0 t o 2 5 9 acres 2 6 0 t o 9 9 9 acres 1,000+ 590 1,600 1,150 1,200 490 650 220 1,200 1,500 2,200 3,600 5,500 6,200 18,000 18,000,000 15,000,000 18,000,000 35,000,000 70,000,000 200,000,000 400,000,000 4,242.64 3,872.98 4,242.64 5,916.08 8,366.60 14,142.14 20,000.00 2,503,1 5 8 . 0 1 6,196,7 7 3 . 3 5 4,879,0 3 6 . 7 9 7,099,2 9 5 . 7 4 4,099,6 3 4 . 1 3 9,192,3 8 8 . 1 6 4,400,0 0 0 . 0 0 38,370,2 8 6 . 1 8 2,600 6,900 18,000 3,500 35,000,000 110,000,000 510,000,000 40,000,000 5,916.08 10,488.09 22,583.18 6,324.56 19,523,0 6 3 . 2 8 6,922,1 3 8 . 4 0 1,129,1 5 8 . 9 8 11,953,4 0 9 . 5 6 39,527,7 7 0 . 2 2 349,620,000,000 Tenure of Operator Full owner Part owner Manager Tenant 3,300 660 50 1,890 289,200,000,000 a. Compute the standard error of the total value of products from a proportionate stratified sample of 300 farms for each of the two methods of stratification (by size and by tenure of the operator). b. Which method of stratification is more efficient for a proportionate sample? c. Compute the standard error of the estimate of the total value of products, using a simple random sample of 300 farms. d. For both methods of stratification, use the Neyman allocation for a sample of 300 farms, and compute: (I) The number of sample farms in each stratum (ii) The standard error of the estimate of the total value of products. e. On the basis of this analysis, which of the four methods of allocating the sample would you recommend? f. Assume that the sample was stratified by tenure and allocated by the optimum method. Assume also that the following means by strata were obtained: 113 Tenure Mean Value of Products Full owner Part owner Manager Tenant 2,900 6,400 20,000 4,000 Estimate the mean value of products for the population of 5,900 farms. g. Describe how you would calculate the standard error of the mean computed in (f) above after the survey results are available? 18. With three strata, the values of the Nh, Sh, and ch are as follows: Stratum Nh Sh ch 1 860 5 2 2 640 4 3 3 1230 6 5 a. Find the sample size in each stratum for a sample of size 200 under an optimum allocation. b. How much will the total field cost be? 19. Using the list of 600 households residing in 30 villages (Appendix IV), select a SRS-WOR of 20 households, and on the basis of the data on the size of these 20 sample households, do the following : a. Determine the number of households for each zone and then select a sample of size nh(h = 1, 2, 3) in each of the three zones. Use proportional allocation. b. Estimate for each of the 3 zones separately: (i) the total number of persons and its sampling error. (ii) the average household (HH) size and its sampling error. c. Estimate for the entire population : (i) the total number of persons and its sampling error. (ii) the average HH size and its sampling error. (iii) the coefficient of variations (CVs) for both number of persons and average HH size. d. Compare the population estimates and standard errors obtained from exercise 19 with those obtained from SRS-WOR in chapter 5. 114 CHAPTER 9 RATIO ESTIMATES _______________________________________________________________________________________________ 9.1 REASONS FOR CONSIDERING USE OF RATIO ESTIMATES In earlier chapters we dealt with the problem of how to design the most efficient sample (from the point of view of minimizing the standard error) using as much relevant information as we can obtain about the population. We have seen how to use information for stratification with either proportionate sampling or optimum allocation, how to take unit costs into account, and how to choose between different kinds of sampling units. We have seen how to use whatever knowledge we have of costs and of the variances of different methods of sampling, in order to produce the maximum amount of information with the resources we have available. All of this analysis has been in terms of fairly simple estimates such as in which the estimates were prepared by using only the sample data, the total number of units (N) in the population and the probabilities of selection. Thus, for simple random sampling, For stratified sampling There are similar formulas for cluster sampling, or for estimation of proportions There are, however, more complex methods of estimating these statistics, which under certain circumstances can result in very large reductions in the standard errors. Moreover, there are other types of statistics which we wish to measure--such as ratios of two characteristics, change over time of a single characteristic, etc. For example, we may obtain information on wages and salary payments and on number of hours worked, but we may be more interested in estimating the average hourly earnings, rather than total wages and salaries or total hours worked. From surveys covering two different periods of time, we may be more interested in finding out whether total wages have gone up or down than in measuring the level at any one time. The analysis of the standard errors estimated ratios also helps with the problem of producing more efficient estimates of means and totals. We shall investigate the simplest and most commonly used method of improving the reliability of an estimated mean or total, by the use of a special estimating technique which produces a "ratio estimate." A number of other very powerful tools are useful in particular situations; for example, difference estimates and regression estimates, double sampling (in which the final sample is selected from a previously selected larger sample that provides information for improving the final selection or the estimation procedure), and special methods for the estimation of time series. However, we will only discuss in this chapter ratio estimates. 9.2 RATIO ESTIMATES OF AGGREGATES Ratio estimation is the most commonly used of the more complex estimation techniques available to the statistician. It is also the easiest to apply. It is appropriate whenever the units of the population possess two characteristics that are positively correlated--the higher the correlation, the greater the gain from using this technique. The simplest kind of ratio estimator of the form given by equation (9.1), is an estimate of Y (the population aggregate):1 (9.1) Here, and are the ordinary estimates of the aggregates of two characteristics Y and X; the aggregate X must be known in order to estimate the aggregate Y. To compute it is not necessary to compute and since, for a self-weighting sample, However, the formula in (9.1) is useful in deriving the variance of Ratio estimates of aggregates are ordinarily applied in the three situations described in sections 9.2.1 to 9.2.3 below. 9.2.1 Ratio to Same or Related Characteristic at an Earlier Time Period X is the same type of characteristic as Y, but X refers to an earlier time period during which a complete census was taken. For example, we may have taken a full census of manufacturers in one year, and wish to take a sample survey the following year. Suppose we wish to estimate the total 1. The ratio estimator of a mean (as an estimate of ) is obtained by dividing coefficient of variation as the estimate. 116 by N; it has the same value of shipments. For each manufacturing establishment in the sample, we obtain not only yi, the value of shipments in the survey year, but also xi, the value during the preceding census year. Then and would be estimates from the sample of total shipments for the two years, obtained by the methods discussed earlier. X is the total value of shipments tabulated from the full census. In this application, the survey is actually used to measure the rate of change between the two years, using the identical sample of establishments. The rate of change is then multiplied by the census total for the previous year. 9.2.2 Ratio of Two Related Characteristics at the Same Time Period Y and X are two different characteristics for the same time period, which are known to be positively correlated. The true value of the aggregate X is known. For example, for the ith farm in a sample, xi may be the total hectares in the farms, and yi the payments for farm labor; the total hectares in all farms, X, is known from another source. If, in general, the larger farms pay more total wages for farm labor than the smaller ones, the ratio estimate can drastically reduce the sampling error. In this application, the survey is used to measure a rate (such as the average payment per hectare) which is multiplied by the known number of hectares. 9.2.3 Ratio of a Subset to the Total The characteristic Y is a subset of X, varying roughly in proportion to X. For example, xi may be total acres in the ith farm in the sample, and yi the acres planted to a particular crop on that farm. Another application is the case in which X is the total number of units of analysis and Y is the number of these having a particular attribute. For example, yi might be the number of persons in the labor force in the ith cluster; xi is the total number2 of persons in this cluster; and X is the known total number of persons in the population. In these cases, the survey is used to measure a ratio which is then multiplied by the population total (X) for the characteristic in the denominator of the ratio. 9.3 VARIANCE AND BIAS OF A RATIO ESTIMATE In examining it is clear that X is not derived from the sample. The sampling error in the estimate is, therefore, dependent on the sampling error of the ratio, with X having only the effect of a constant multiplier. Therefore, an analysis of the sampling error of 2. In cluster sampling, the estimate of the total number of units of analysis will be a random variable, which is usually not exactly equal to the true figure . Hence, the proportion of units having the attribute must be treated as a ratio of random variables. 117 as an estimate of R = is closely related to that of the ratio The mathematical form of the distribution of the ratio of two random variables from sample to sample is much more complicated than that of the simpler estimates discussed earlier. It involves the relationship of two variables, both of which have sampling errors. Hence, more care is required in deciding when to use such ratios. The following facts about the variance of ratios and ratio estimates will indicate when to use a ratio estimator to estimate a mean or an aggregate. They also tell us what error to expect when using the estimate. 9.3.1 Variance of Ratios and Ratio Estimates The variance of an estimated ratio is approximately (9.2) where R is the population ratio (a ratio of aggregates), and Similarly, the variance of the ratio estimate of a total , (9.3) = The alternative form of this equation is (9.3a) and is estimated by , 118 is (9.3b) Equations (9.2) and (9.3) are somewhat simpler if expressed in terms of the coefficient of variation, CV. The square of the coefficient of variation (that is, the rel-variance) of of is the same as that and can be expressed as (9.4) In the above formulas, D is the coefficient of correlation between the variables Y and X. It represents the correlation of Y and X, not for the elementary units of analysis but for the units used for sampling. For example, if Y and X represent the incomes of persons in two different years, but the sample is a cluster sample, the correlation coefficient D will be the correlation between the values Yi and Xi where Yi is the sum of the incomes for all persons in the ith cluster in the year of estimation and Xi is the corresponding sum in the base year. Frequently, is referred to as the sampling covariance between and and the symbol is used for it. It can be calculated exactly as the variance, but with the cross product replacing the square wherever it occurs. Thus, for simple random sampling we have (9.5) where (9.6) and an estimate of SYX can be made from the sample by using (9.7) The corresponding estimate of D, designated by D’, is obtained by putting sample values of and in place of the population values, in equation (9.5), and solving for D, which then becomes D’ . The D’ may also be computed directly from 119 (9.8) For a stratified sample, with estimates of totals given by and (9.9) where are within-strata covariances and are computed in exactly the same way, but are restricted to the values within each stratum. 9.3.2 Gains with a Ratio Estimate If we examine equation (9.4), the formula for the rel-variance of an estimate of a total, (9.4) we see that CV² of the ratio estimate plus the term can be expressed as CV² of the simpler estimate minus the term Whether we gain or lose by the use of a ratio estimate, as compared with the simpler estimate depends on whether is smaller or greater than zero. Another way of expressing this is the following: (1) If a ratio estimate is more efficient (2) If a ratio estimate is less efficient 120 (3) If 9.3.2.1 both estimates have the same standard error. High Correlation. To see the implication of these facts in some common situations, consider the example of a census of manufacturers which was conducted in one year, followed by a sample the next year. Let yi and xi represent the values of shipments for the same sample firm in two consecutive years. In this case and are nearly the same, and is approximately 1. Furthermore, there will be a very high correlation between Y and X, probably about 0.90 or 0.95. Consequently, a ratio estimate will result in a substantial gain in accuracy. The amount of the gain can be found as follows: if Equation (10.4) becomes and if D = .90, we have In other words, the use of a ratio estimate achieves an 80 percent reduction in variance. If D = .95, becomes equal to and the reduction is 90 percent. Looking at the result in another way, the ratio estimate is as effective as using a sample 5 times (or 10 times) as large. 9.3.2.2 Low Correlation. Consider now the situation described in section 9.2.3 in which Y is a subset of X. In such cases, the correlation is likely to be quite low, unless practice, if is fairly large--for example, greater than ½. In is less than about 20 percent, a ratio estimate may increase the sampling error although, generally, not much. If is greater than 40 or 50 percent, a ratio estimate will usually 121 improve the efficiency; the closer to 100 percent, the more the improvement. Between 20 and 40 percent, the differences between the two types of estimates will be small. Thus, for example, in a labor force survey, the use of ratio estimates probably provides an important improvement in the estimate of the number of employed (which comprises a fairly high proportion of the adult population) but probably results in a slight increase in the standard error of the estimate of unemployed. 9.3.3 Bias of the Ratio Estimate The ratio estimate is a biased estimate. This can easily be demonstrated by constructing a small population with values Yi and Xi for each element, taking all possible samples of two or three elements, and computing for each sample. It will be seen that the average of the ratios is not the true average. However, the bias tends to be negligible for moderately large samples. In most practical applications, the bias is so small compared with the advantage gained in reducing the sampling error, that the ratio estimate is preferred over the unbiased estimate. 9.3.4 Consistent Estimates A ratio estimate, although biased, is a consistent estimate. This means that, if we use a large enough sample, we can be sure that the estimate will be as close as we like to the true value. Not only does the standard error decrease with increasing sample size, but the bias is also reduced. 9.3.5 Confidence Limits For reasonably large samples, ratio estimates are normally distributed (for the kinds of populations dealt with in practice). Consequently, if we can compute the standard error of the ratio estimate, we can construct the same type of confidence limits for and as for and that is, we can say that we have a 68-percent chance that a range around the estimate of plus and minus one standard error will cover the true figure, a 95-percent chance that a range of plus and minus two standard errors will cover the true figure, etc. 9.3.6 Minimum Sample Size Required Sections 9.3.3 to 9.3.5 above refer to the fact that moderately large samples are needed to make the bias negligible, and to provide a reasonably normal distribution of sample estimates. When is the sample large enough? The following working rule has been suggested: If the sample size exceeds 30 and if the coefficients of variation of and are both less than 10 percent, then the bias is negligible and we can assume that the theory for the normal distribution applies. The first condition does not mean that a ratio estimate is necessarily better than a simple unbiased estimate whenever n > 30; it means this size of sample is required before the formulas for sampling error have the usual meaning in terms of confidence intervals. 122 9.3.7 Formula for Bias An approximation to the bias of an estimate of a ratio of two variables where D and R are defined as in section 3.1. For the estimate of a total, Even with low values of D this will be small compared with the standard error of that the sample is reasonably large so that is the bias is provided only is small. These bias formulas are presented for analytical purposes. They are never used to adjust estimates. In situations where the bias would be expected to be significantly large, we would either increase the sample size or use a different method of estimation. 9.3.8 Danger in Use of Ratio Estimate If ratio estimates are applied separately for a large number of subgroups of the population, with a small sample in each subgroup, the bias in the subgroup may accumulate and become too large to ignore. For example, suppose a relatively small sample of persons is classified by separate age-sex groups--300 persons divided into 5-year age groups by sex. There would be about 30 such groups. Suppose we know the true total population in each of these 30 groups. For any statistic we are interested in, we could compute a separate ratio estimate for the persons in each of the 30 groups, and then get a final estimate by adding the 30 results. The average size of sample in each group would be 10. Since there would be only a small sample in each of the age groups for which a ratio estimate would be formed, the accumulation of 30 different ratio estimates could result in a serious bias. In such a case, the use of ratio estimation group-by-group is not recommended. 9.3.9 Illustration Suppose that a complete census of the value of manufacturing shipments was taken in 1981. The following table shows the value of shipments in each of a simple random sample of the value of 10 shipments drawn from the value of 30 shipments. The problem is to estimate the total value of shipments in 1982. The true 1981 total, X is assumed to be known . Its value is $19.5 billion. 123 Value of shipments in 1981 (xi) 0.3 1.1 0.5 0.4 1.0 0.7 0.2 0.3 2.4 0.1 Value of shipments in 1982 (yi) 0.1 0.6 0.8 0.6 1.0 0.8 0.9 0.8 2.7 0.2 We have, N = 30, n = 10 Compute the estimate of the total and the variance, the coefficient of variation of the estimate and the confidence interval for Y by using (a) a method of simple random sampling and (b) a method of ratio estimates. (a) Simple random sampling (1) (2) (3) (4) 124 (5) 95 % confidence interval for Y is (b) Ratio Estimates (1) = Using equation (9.3b), (2) = = (3) (4) (5) A 95 % confidence interval for Y is ($16.46, $30.90) billion 125 Formulas for Ratio Estimation Variances Population Ratio R: Estimated Variance of r: Ratio Estimator of the Population Total Ty: Estimated Variance of Ratio Estimator of a Population Mean :y: 126 Estimated Variance of 127 Ratio Estimation 1. A forester is interested in estimating the total volume of trees in a timber sale. He records the volume for each tree in a simple random sample. In addition he measures the basal area for each tree marked for sale. He then uses a ratio estimator of total volume. The forester decides to take a simple random sample of n = 12 from the N = 250 trees marked for sale. Let x denote basal area and y the cubic foot volume for a tree. The total basal area for all 250 trees, Tx, es 75 square feet. Use the data below to estimate Ty, the total cubic foot volume for those trees marked for sale, and place a bound on the error of estimation. Tree Sampled 1 2 3 4 5 6 7 8 9 10 11 12 Basal Area (x) Cubic foot volume (y) .3 .5 .4 .9 .7 .2 .6 .5 .8 .4 .8 .6 6 9 7 19 15 5 12 9 20 9 18 13 0.09 0.25 0.16 0.81 0.49 0.04 0.36 0.25 0.64 0.16 0.64 0.36 36 81 49 361 225 25 144 81 400 81 324 169 3.24 20.25 7.84 292.41 110.25 1 51.84 20.25 256 12.96 207.36 60.84 3=7 3 = 142 3=4 3 = 1,976 3 = 1,044.24 Answer: Total = 1589.5522; Error = 186.3176 2. Use the y-data in Exercise 1 to compute an estimate of Ty using Place a bound on the error of estimation. Compare your results to those obtained in Exercise 1. Answer: Total = 2958.3333; Error = 730.13697 3. A consumer survey was conducted to determine the ratio of the money spent on food to the total income per year for households in a small community. A simple random sample of 14 households was selected from 150 in the community. Sample data are tabulated below. Estimate R, the population ratio, and place a bound on the error of estimation. 128 Household 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Total Income (xi) Amount spent on food (yi) 5,010 12,240 9,600 15,600 14,400 6,500 8,700 8,200 14,600 12,700 11,500 10,600 7,700 8,500 990 2,524 1,935 3,123 2,760 1,337 1,756 2,132 3,504 2,286 2,875 2,226 1,463 1,905 25,100,100 149,817,600 92,160,000 243,360,000 207,360,000 42,250,000 75,690,000 67,240,000 213,160,000 161,290,000 132,250,000 112,360,000 59,290,000 72,250,000 980,100 6,370,576 3,744,225 9,753,129 7,617,600 1,787,569 3,083,536 4,545,424 12,278,016 5,225,796 8,265,625 4,955,076 2,140,369 3,629,025 4,959,900 30,893,760 18,576,000 48,718,800 39,744,000 8,690,500 15,277,200 17,482,400 51,158,400 29,032,200 33,062,500 23,595,600 11,265,100 16,192,500 3=145,850 3= 30,816 3= 1,653,577,700 3= 74,376,066 3= 348,648,860 Answer: Ratio = .2113; Error = .0126 4. A corporation is interested in estimating the total earnings from sales of color television sets at the end of a given three month period. The total earnings figures are available for all districts within the corporation for the corresponding three month period of the previous year. A simple random sample of 13 districts offices is selected from the 123 offices within the corporation. Using a ratio estimator, estimate Ty and place a bound on the error of estimation. Use the data in the table below and take Tx = 128,200. Office 1 2 3 4 5 6 7 8 9 10 11 12 13 Three month data from previous year (x i) Three month data from current year (y i) 550 720 1,500 1,020 620 980 928 1,200 1,350 1,750 670 729 1,530 610 780 1,600 1,030 600 1,050 977 1,440 1,570 2,210 980 865 1,710 302,500 518,400 2,250,000 1,040,400 384,400 960,400 861,184 1,440,000 1,822,500 3,062,500 448,900 531,441 2,340,900 372,100 608,400 2,560,000 1,060,900 360,000 1,102,500 954,529 2,073,600 2,464,900 4,884,100 960,400 748,225 2,924,100 335,500 561,600 2,400,000 1,050,600 372,000 1,029,000 906,656 1,728,000 2,119,500 3,867,500 656,600 630,585 2,616,300 3= 13,547 3= 15,422 3= 15,963,525 3= 21,073,754 3= 18,273,841 Answer: Total = 14,943.7809; Error = 7,353.67 5. Use the data in Exercise 4 to estimate the mean earnings for offices within the corporation. Place a bound on the error of estimation. 129 Answer: Mean = 1,186.5348; Error = 59.79 6. An investigator has a colony of N = 763 rats which have been subjected to a standard drug. The average length of time to thread a maze correctly under influence of the standard drug was found to be :x = 17.2 seconds. The investigator now would like to subject a random sample of 11 rats to a new drug. Estimate the average time required to thread the maze while under the influence of the new drug. Place a bound on the error of estimation. (Hint: it is reasonable to employ a ratio estimator for :y if we assume that the rats will react to the new drug in much the same way as they did the standard drug.) Rat Standard Drug (xi) 1 2 3 4 5 6 7 8 9 10 11 New Drug (yi) 14.3 15.7 17.8 17.5 13.2 18.8 17.6 14.3 14.9 17.9 19.2 15.2 16.1 18.1 17.6 14.5 19.4 17.5 14.1 15.2 18.1 19.5 204.49 246.49 316.84 306.25 174.24 353.44 309.76 204.49 222.01 320.41 368.64 231.04 259.21 327.61 309.76 210.25 376.36 306.25 198.81 231.04 327.61 380.25 217.36 252.77 322.18 308 191.4 364.72 308 201.63 226.48 323.99 374.4 3= 181.2 3= 185.3 3= 3,027.06 3= 3,158.19 3= 3,090.93 Answer: Mean = 17.5892; Error = .2710 7. A group of 100 rabbits is being used in a nutrition study. A pre-study weight is recorded for each rabbit. The average of these weights is 3.1 pounds. After two months the experimenter wants to obtain a rough approximation of the average weight of the rabbits. He selects n = 10 rabbits at random and weighs them. The original weights and current weights are presented below: Rabbit 1 2 3 4 5 6 7 8 9 10 Original weight 3.2 3.0 2.9 2.8 2.8 3.1 3.0 3.2 2.9 2.8 Current weight 4.1 4.0 4.1 3.9 3.7 4.1 4.2 4.1 3.9 3.8 Estimate the average current weight and place a bound on the error of estimation. Answer: Mean = 4.1646; Error = .0847 8. A social worker wants to estimate the ratio of the average number of rooms per apartment to the average number of people per apartment in an urban ghetto area. He selects a simple random sample of 25 apartments from the 275 in the ghetto area. Let xi denote the number of people in apartment I, and let yi denote the number of rooms in apartment I. From a count of the number of rooms and number of people in each apartment, the following data are 130 obtained: Estimate the ratio of average number of rooms to average number of people for this area, and place a bound on the error of estimation. Answer: Ratio = .283; Error = .0616 9. A forest resource manager is interested in estimating the number of dead fir trees in a 300 acre area of heavy infestation. Using an aerial photo, he divides the area into 200 one and a half acre plots. Let x denote the photo count of dead firs and y the actual ground count for a simple random sample of n = 10 plots. The total number of dead fir trees obtained from the photo count is Tx = 4,200. Use the sample data below to estimate Ty, the total number of dead firs in the 300 acre area. Place a bound on the error of estimation. Plot sampled Photo count (xi) Ground count (yi) 1 2 3 4 5 6 7 8 9 10 12 30 24 24 18 30 12 6 36 42 18 42 24 36 24 36 14 10 48 54 144 900 576 576 324 900 144 36 1296 1764 324 1,764 576 1,296 576 1,296 196 100 2,304 2,916 216 1,260 576 864 432 1,080 168 60 1,728 2,268 3= 6,660 3=11,348 3=8,652 Answer: Total = 5,492.3077; Error = 428.4381 10. Members of a teachers’ association are concerned about the salary increases given to high school teachers in a particular school system. A simple random sample of n = 15 teachers is selected from an alphabetical listing of all high school teachers in the system. All 15 teachers are interviewed to determine their salaries for this year and the previous year. Use these data to estimate R, the rate of change, for N = 750 high school teachers in the community school system. Place a bound on the error of estimation. 131 Teacher 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Past year’s salary (xi) Present year’s salary (yi) 5,400 6,700 7,792 9,956 6,355 5,108 7,891 5,216 5,416 5,397 8,152 6,436 9,192 7,006 7,311 5,600 6,940 8,084 10,275 6,596 5,322 8,167 5,425 5,622 5,597 8,437 6,700 9,523 7,279 7,582 29,160,000 44,890,000 60,715,264 99,121,936 40,386,025 26,091,664 62,267,881 27,206,656 29,333,056 29,127,609 66,455,104 41,422,096 84,492,864 49,084,036 53,450,721 31,360,000 48,163,600 65,351,056 105,575,625 43,507,216 28,323,684 66,699,889 29,430,625 31,606,884 31,326,409 71,182,969 44,890,000 90,687,529 52,983,841 57,486,724 30,240,000 46,498,000 62,990,528 102,297,900 41,917,580 27,184,776 64,445,797 28,296,800 30,448,752 30,207,009 68,778,424 43,121,200 87,535,416 50,996,674 55,432,002 3= 103,328 3= 107,149 3= 743,204,912 3=798,576,051 3= 770,390,858 Answer: Ratio = 1.037; Error = 0.001391 11. An experimenter was investigating a new food additive for cattle. Midway through the two month study, he was interested in estimating the average weight for the entire herd of N = 500 steers. A simple random sample of n = 12 steers was selected from the herd and weighed. These data and prestudy weights are presented below for all cattle sampled. Assume :x, the pre-study average, was 880 lbs. Estimate :y, the average weight for the herd, and place a bound on the error of estimation. All the weights below are in pounds. Steer 1 2 3 4 5 6 7 8 9 10 11 12 Pre-study weight (xi) Present weight (yi) 815 919 690 984 200 260 1,323 1,067 789 573 834 1,049 897 992 752 1,093 768 828 1,428 1,152 875 642 909 1,122 664,225 844,561 476,100 968,256 40,000 67,600 1,750,329 1,138,489 622,521 328,329 695,556 1,100,401 804,609 984,064 565,504 1,194,649 589,824 685,584 2,039,184 1,327,104 765,625 412,164 826,281 1,258,884 731055 911648 518880 1075512 153600 215280 1889244 1229184 690375 367866 758106 1176978 3= 9,503 3= 11,458 3= 8,696,367 3= 11,453,476 3= 9,717,728 Answer: Mean = 1,061.0376; Error = 139.9468 12. An advertising firm is concerned about the effect of a new regional promotional campaign on the total dollar sales for a particular product. A simple random sample of n = 20 stores is drawn from the N = 452 regional stores in which the product is sold. Quarterly sales data are obtained for the current three-month period and the three-month period prior to the new campaign. Use these data to estimate Ty, the total sales for the current period, and place a bound on the error of estimation. Assume Tx = 216,256. 132 Stor e Pre-Campaign Sales (xi) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Present Sales (yi) 208 400 440 259 351 880 273 487 183 863 599 510 828 473 924 110 829 257 388 244 239 428 472 276 363 942 294 514 195 897 626 538 888 510 998 171 889 265 419 257 43,264 160,000 193,600 67,081 123,201 774,400 74,529 237,169 33,489 744,769 358,801 260,100 685,584 223,729 853,776 12,100 687,241 66,049 150,544 59,536 57,121 183,184 222,784 76,176 131,769 887,364 86,436 264,196 38,025 804,609 391,876 289,444 788,544 260,100 996,004 29,241 790,321 70,225 175,561 66,049 49712 171200 207680 71484 127413 828960 80262 250318 35685 774111 374974 274380 735264 241230 922152 18810 736981 68105 162572 62708 3= 9,506 3= 10,181 3= 5,808,962 3= 6,609,029 3= 6,194,001 Answer: Total = 231,611.86; Error = 3,073.83 13. Use the data of Exercise 12 to determine the sample size required to estimate Ty with a bound on the error of estimation equal to $3,800. Answer: n = 14. 14. A 10-percent simple random sample of housing units in a village has been selected producing the 12 housing units listed below. At each sample unit, information was obtained on the number of persons in the household and the total annual earnings; the results are given below. It is also known from independent sources that the total population of all households in the village is 600 persons. 133 Sample unit 15. Total persons Total earnings 1 6 $ 7,000 2 6 8,000 3 5 3,000 4 8 10,000 5 4 2,000 6 2 1,000 7 4 2,000 8 5 3,000 9 1 1,000 10 7 8,000 11 4 1,000 12 5 6,000 Total 57 $52,000 a. Estimate the total earnings in all households in the village using a direct inflation factor. b. Estimate the total earnings in all households in the village using a ratio estimate. c. Use the sample results to estimate the coefficient of variation for each of the above estimates. The following table shows the total hectares in three farms along with the payments for farm labor draw1n from 30 farms. The true value of the total hectares of all farms, X is assumed to be 800. Farm (I) Hectares (xi) Payments (yi) 1 5 382 2 8 467 3 10 701 for all farms Y. a. Estimate the total payments b. Estimate the variance of c. d. Compute the coefficient of variation of Find a 95% confidence interval for Y. 134 Chapter 10 CLUSTER SAMPLING ________________________________________________________________________________________ 10.1 DESCRIPTION OF CLUSTER SAMPLING The discussion so far has been about sampling methods in which the units of analysis (people, farms, business firms, etc.) were considered as arranged in a list (or its equivalent) and a sample of individual units could be selected directly from the list. Now we will consider a sampling procedure in which the units of analysis in the population are grouped into clusters and a sample of clusters (rather than a sample of individual units of analysis) is selected. The sample clusters then determine the units to be included. The determination may be made in either of two ways: (1) The sample could include all units in the selected clusters. This is usually referred to as single-stage cluster sampling. (2) A subsample of units in the selected clusters could be selected for enumeration. This is called multi-stage cluster sampling, or simply multi-stage sampling. There are two main reasons for using cluster sampling. Often there is no adequate frame (such as a list) from which to select a sample of the elements in the population, and the cost of constructing such a frame may be too great. In other cases, such a frame may exist but the savings in field costs obtained by cluster sampling (on some kind of geographical basis) may make this method more efficient than a simple random sample from a list. In most practical situations, a sample of a given number of units selected at random will have smaller variance than a sample of the same size selected in clusters; nevertheless, when cost is balanced against precision, the cluster sample may be more efficient. Even though the units in which we are interested are not selected directly, the probability of selecting a cluster and each unit in it (i.e., the probability of selecting a unit from the population) is fixed in advance; consequently, cluster sampling satisfies the criterion for probability sampling. Let us consider some examples to see how cluster sampling works. 10.1.1 Single-Stage Cluster Sampling To draw a sample of persons, it would generally not be feasible to obtain a list of all persons, and then to select a sample from the list. It might be possible to find a list of families. We could then select a sample of families and obtain information by interview concerning all persons in the selected families. This is an example of single-stage cluster sampling; the family constitutes the cluster. Note that for a given number of individuals in the sample, it would undoubtedly be less costly in terms of both travel and time to take all persons within selected families than to select the same number of persons at random from all individuals in the population. Often there is no list of families available, and some other procedure must be used. A possible method is as follows. In large cities, a map showing the boundaries of city blocks can usually be obtained; and we can select a sample of blocks. In the rest of the country, we can use maps divided into small areas called segments, which have identifiable boundaries, and select a sample of segments. Within the sample blocks and segments, we could include all persons in the sample; alternatively, we could select a sample of persons living in the selected blocks. The choice would depend upon the number of stages of sampling we believe would be most efficient. By using maps, we eliminate the need for a list of all persons. We replace it with a list of blocks and segments and a list of families within a sample of blocks and segments. (In practice there frequently is an earlier stage of sampling in which a sample of cities and/or other administrative areas is selected.) The preceding discussion illustrates an important application of cluster sampling; namely, area sampling. However, other applications of cluster sampling are frequently made. 10.1.2 Multi-Stage Cluster Sampling Suppose we wish to make a survey of school children in order to obtain information on their health, or information on their knowledge of a particular subject. One way to do this is to obtain a complete list of schools, then select a sample of schools, and finally choose a sample of children within the selected schools. Similarly, a sample of factory workers could be selected by first choosing a sample of factories and then interviewing a sample of workers within these factories. In both cases we would need to construct a list of individuals only for the schools or factories selected in the sample. These examples illustrate multi-stage (specifically, two-stage) cluster sampling. The probability that any unit in the population is selected in the sample can be expressed as the product of the probabilities at each stage. Thus, in the first example the probability of selecting the jth child from the ith school is the probability of first selecting the ith school times the conditional probability of selecting the jth child, given that the ith school has been selected. That is, P(jth child, ith school) = P(ith school) x P(jth child 10.2 ith school). AREA SAMPLING Since area sampling is a frequently used application of cluster sampling, we shall describe in more detail the methods which are usually applied. Area sampling is useful when one or both of the following conditions exist: (1) When complete lists of housing units (or other desired units of observation) are not available but maps having a reasonable amount of detail are available. Such maps can be considered as a list covering all of the housing units in the area. (2) When there are large travel costs in sending an interviewer from one randomly selected sample housing unit to another randomly selected housing unit. For a given amount of money, we may be able to increase the number of sample housing units greatly by grouping units together and selecting a random sample of groups. Three simple procedures exist for drawing an area sample. We shall use city blocks as an illustration (segments of land with identifiable boundaries around them could be used in rural areas in exactly the same way as blocks are used in cities). We shall assume that a 1-percent sample of housing units is to be drawn. 136 Procedure A for a sample of areas to be enumerated completely: (1) Obtain a reasonably accurate map of the city, showing as much detail as possible for blocks. If the map is not new, one should take steps through local inquiry to bring it upto-date (for example, draw in new streets that have been opened since the map was printed). (2) Number the blocks serially, entering the numbers directly on the map; a serpentine numbering system is advisable in order to make certain that no blocks are omitted. (3) Select a simple random or systematic sample of blocks, using a 1-percent sample. If a systematic sample is used, select a random number from 1 to 100 to determine the first sample block, and include every one-hundredth block thereafter. (4) Interview all households in the sample blocks. Procedure B for a sample of areas with subsampling of smaller areas: The 1-percent sample can also be obtained by drawing, for example, a sample of 1 in 25 blocks, then taking a subsample of one-fourth of the area in each sample block. (1) Proceed as in (1), (2), and (3) in procedure A above, except that instead of taking 1 in 100 blocks, take 1 block in 25. (2) Divide each of the sample blocks into 4 segments. If maps are available that show the internal structure of each block (alleys, buildings, etc.), these can be used. If not, make a quick and crude sketch of the sample blocks, showing each building; use this sketch as the basis of the segmentation. The 4 segments within any block should have roughly the same number of housing units in each. (3) Number the segments in each block from 1 to 4. (4) Select the sample segments by taking a random number from 1 to 4 for each block. (5) Interview all households in the selected segments. Notice that although a 1-percent sample is obtained in both procedures, procedure B includes more sample blocks and fewer housing units per block. Usually, it will cost more to obtain the same sample size by procedure B, since there is a cost of subsampling not involved in procedure A; also, travel will be increased in visiting a greater number of blocks. This subsampling procedure is almost equivalent to dividing every block in the city into 4 parts, or segments, and taking 1 in 100 of these segments. Hence, the use of subsampling as described above in procedure B can be regarded as essentially equivalent to using a sample of small clusters of housing units (in which every housing unit would be enumerated) but with two-stage sampling as a device for reducing the work of drawing a sample of small clusters. 137 Procedure C for a sample of areas with listing and subsampling: To carry out procedure B, it is necessary to have or to construct detailed maps. A third procedure accomplishes approximately the same results and is frequently applicable when detailed maps are not available and are not easy to prepare. (1) Proceed as in step (1) of procedure B, again selecting a sample of 1 in 25 blocks. (2) Visit each sample block and make a list of all the housing units in it. Number the housing units serially. The numbering can be done (a) separately by blocks (that is, starting with 1 for each block), (b) in a single sequence throughout all the sample blocks, or © by some combination, such as a separate sequence for various groups of blocks. (3) Select one-fourth of the housing units within the sample blocks either by using a random number table, or by systematic sampling using the serial numbers assigned to the housing units. (4) Interview the households whose serial numbers are selected for the sample. Note: If advance information is available on the approximate numbers of housing units in all blocks, some combination of the above procedures with stratification of blocks by size can be used. 10.3 CHOICE OF SAMPLING UNIT AND SAMPLE DESIGN In designing a sample, the sampling statistician must decide how many sampling stages are to be used. In addition, at each stage he must determine the sampling unit. In making his decision, the statistician often has many alternatives from which to choose. Suppose, for example, that he desires to estimate the average number of cattle per holding. Ultimately, the information must be obtained from a sample of individual holdings (units of analysis or elementary units). In order to obtain such a sample, however, any of the following plans could be used: (1) A simple random, systematic, or stratified sample of individual holdings could be taken if complete and accurate lists of holdings were available. (2) Maps could be used to subdivide the country into small area segments (for example, segments containing an average of 5 or 10 holdings). A sample of these area segments could then be selected, and all holdings within each selected segment included in the sample. For holdings which extend across segment boundaries, rules would be needed to associate holdings with segments. (3) A sample of small administrative subdivisions, such as districts, could be selected. All holdings in the selected districts could be included in the sample, or a subsample of holdings could be selected. (4) A sample of provinces (larger administrative divisions) could be selected, and a sample of areas and holdings within the selected provinces could be taken in one of the ways described in procedures A, B, and C above. 138 Where subsampling is used, the cluster initially selected is called the first-stage unit or the primary sampling unit (PSU) and the unit of subsampling is called the second-stage unit (SSU). For example, in (3) above, if a subsample of holdings is selected, the "district" is the PSU and the holding is the second-stage unit; in (4), the "province" is the PSU, the small area is the second-stage unit, and still smaller areas or holdings may be third-stage units (TSU). How can one make an intelligent choice among the various alternatives? We may reason as follows: where cost is not important, single-stage sampling using the elementary unit (the holding in the above case) as the sampling unit provides the most accurate results for the given number of elementary units in the sample. (There are some exceptions, but these are rather unusual cases.) On the other hand, when cost and administrative convenience are important, a cluster sample involving one or more stages may be desirable. The cost of enumeration per elementary unit is usually much less if the units are in clusters than if they are randomly distributed throughout the country; by clustering, travel time and cost for interviewing are reduced. As a result, for a given amount of money it may be possible, by using cluster sampling, to increase the number of elementary units in the sample above the number that the same budget would allow if these were selected at random. If the increase in the number of units more than compensates for the fact that a cluster sample tends to increase the standard error, a net gain will be obtained in the reliability of estimates made from the sample. In order to choose among alternative sampling units, we must therefore balance the expected costs against the standard errors for the various possible designs and use the method which will provide the smallest standard error for a fixed cost. In some administrative situations, the correct decision may be obvious. If the survey involves little or no travel cost--for example, if mail questionnaires are used, or if the survey uses personnel who travel around as a normal part of their other activities, such as policemen or postmen (mailmen)--and if listings of elementary units are available, the elementary unit should always be taken as the sampling unit. If travel costs or the costs of constructing lists of elementary units are rather large, an alternative design using a clustered sample will usually be better. A full discussion of this matter is beyond the scope of these chapters, but some of the important points will be discussed here. 10.4 ANALYSIS OF COSTS Usually there is a fixed budget available for a survey, and one of the major functions of the sampling statistician is to provide a method of obtaining the smallest sampling error for this budget. Let us first examine how costs enter into a survey involving the use of cluster sampling. In studying stratified sampling, we discussed the possibility that enumeration and processing costs can vary from stratum to stratum, and we constructed a cost function which expressed the variable part of the total cost as a sum of unit costs multiplied by sample sizes (for example, C = C n + C n + ...). A similar approach is needed for cluster sampling, although the unit costs are of a different type. For simplicity, let us consider a two-stage sample. 1 10.4.1 1 Components of Cost In order to analyze the costs of a two-stage cluster sample, it is necessary to identify the various phases of the survey and to distinguish between three elements of cost: 139 2 2 (1) Overhead costs; that is, those costs that are fixed regardless of the manner in which the sample is selected. (2) Costs that depend primarily on the number of first-stage clusters in the sample, and the way in which such costs vary as the number of these primary sampling units in the sample varies. (3) The costs that depend primarily on the number of second-stage units in the sample, and the way in which such costs vary with this number. 10.4.1.1 Overhead Costs Overhead costs include such things as the administrative and technical work required for the survey, rent for space and for some types of equipment, cost of printing the final results, etc. These costs will generally be approximately the same, even with great variations in the size and design of the survey. Since these costs are not affected by the size of the survey, they do not enter into the decision on sample design. The only reason for separating these costs is to subtract them from the total available budget in order to see what funds can be spent on the variable costs. 10.4.1.2 Costs of First-Stage Units Certain costs will usually vary in proportion to the number of first-stage sampling units. These will include (a) the cost of selecting, traveling to, and locating each first-stage unit, (b) the cost of preparing a list of second-stage units (within the primary unit), and © the cost of designating the subsample of second-stage units. There may also be other costs (costs of preparing maps for the first-stage sample units, hiring special enumerators to handle each one, etc.) depending on the nature of the administrative organization, and the materials available before the start of the survey. 10.4.1.3 Costs of Second-Stage Units The costs which depend on the number of second-stage units will include the costs of interviewing, reviewing the survey results, coding, recording, etc. 10.4.2 A Simple Cost Function Let us assume a simple situation in which the cost per first-stage unit does not change despite changes in the number of such units in the sample. Similarly, the cost per second-stage unit does not change. Then the total variable cost (which excludes overhead costs) can be represented by where C1 is the cost per first-stage unit, C2 is the cost per second-stage unit, n is the total number of first-stage sampling units. m is the total number of second-stage sampling units. is the average number of second-stage units in a primary unit. 140 Using equation (10.1), one can set down combinations of n and m which would add up to the same cost. For example, suppose the total variable cost available for a survey was $2,500, and the estimates of C1 and C2 were $10 and $2, respectively. The table below shows various combinations of sample sizes all of which would cost exactly $2,500; the last column shows the average size of cluster for each allocation: Number of units in sample Average First stage (n) Second stage (m) 10 1200 120 20 1150 57.5 50 1000 20 75 875 11.7 100 750 7.5 125 625 5 150 500 3.3 If the sampling error can be found for each of the above combinations, one can choose that combination which would give the lowest sampling error. In fact, with this simple type of cost function, it is usually possible to determine the optimum allocation mathematically. However, this is not necessary; if a formula can be found which expresses the variance in terms of n and m, we can easily see which combination is best. Furthermore, this can also be done in situations involving more complex cost functions, when it is more difficult to develop a mathematical solution to the problem of optimum allocation. The next chapter will be devoted to analyzing the variances for the simpler and more common situations. 10.4.3 More Complex Cost Functions One additional comment on costs should be made. The formulation of the cost function above as C = C1 n + C2 m covers the simplest type of situation only. In practice, the cost function may be much more complex. For example, there may be stratification for either the first-stage or the second-stage units with different unit costs in each stratum. The cost function would then be and the problem of the allocation of the sample would be a combination of optimum allocation for cluster sampling with optimum allocation for stratified sampling. Frequently, the unit costs would depend on the number of units in the sample. 141 For example, suppose that C1 included a part that resulted from the time spent traveling from one first-stage unit to another. With only a few primary units in the sample, the average distance from one to the next might be quite large, resulting in a high value of C1. However, as the number of units in the sample increases, the average distance gets smaller and C1 will be smaller. A different type of cost function would be used in such a situation. In general, in planning a large-scale and important survey, a detailed analysis should be made of how costs vary, in order to construct a cost function which is realistic for that particular survey. 10.5 ESTIMATION OF A POPULATION MEAN AND TOTAL Cluster sampling is simple random sampling with each sampling unit containing a number of elements. Hence, the estimators of the population mean, :, and total, T, are similar to those for simple random sampling. In particular, the sample mean, is a good estimator of the population mean, :. An estimator of : and two estimators of T are discussed in this section. The following notation is used in this chapter: N n mi = = = the number of clusters in the population the number of clusters selected in a simple random sample the number of elements in cluster I, I = 1, . . . ., N. Average cluster size for the sample. The number of elements in the population the average cluster size for the population yi = The total of all observations in the ith cluster. The estimator of the population mean, :, is the sample mean, which is given by: Thus, takes the form of a ratio estimator, as developed in Chapter 11, with mi taking the place of xi. Then, the estimated variance of has the form of the variance of a ratio estimator: 142 Estimator of the Population Mean :: (10.1) Estimated variance of (10.2) The bound on the error of estimation is therefore can be estimated by if M is unknown. The estimated variance in equation in (10.2) is biased and a good estimator of only if n is large, say n $ 20. The bias disappears if the cluster sizes m1, m2. . . . mN are all equal. Estimator of the population total T: (10.4) Estimated variance of (10.5) Note that the estimator is useful only if the number of elements in the population, M, is known. Often the number of elements in the population is not known in problems for which cluster sampling is appropriate. This makes it impossible to use the estimator but we can form another estimator of the population total which does not depend n M. The quantity 143 given by (10.7) is the average of th cluster totals for the n sampled clusters. Hence, is an unbiased estimator of the average of the N cluster totals in the population. By the same reasoning as used previously, the estimator is un unbiased estimator of the sum of the cluster totals or, equivalently, of the population total, T. For example, it is highly unlikely that the number of adult males in a city would be known, and hence the estimator rather than would have to be used to estimate T. An estimator of the population total, T, which does not depend on M: (10.8) The estimated variance of (10.9) If there is a large amount of variation among the cluster sizes and if cluster sizes are highly correlated with cluster totals, the variance of is generally larger than the variance of estimator does not use the information provided by the cluster sizes hence, may be less precise. The and, The estimators of : and T possess special properties when all cluster sizes are equal, that is, when First, the estimator given by equation (10.1), is an unbiased estimator of the population mean :. Second, variance of given by equation (10.2), is an unbiased estimator of the Finally, the two estimators, and of the population total T are equivalent. 10.6 Selecting the Sample Size for Estimating Population Means and Totals The quantity of information in a cluster sample is affected by two factors, the number of clusters and the relative cluster size. We have not encountered the latter factor in any of the sampling procedures discussed previously. In the problem of estimating the number of homes with inadequate fire insurance in a state, the clusters could be counties, voting districts, school districts, communities, or any other convenient grouping of homes. We will assume that the relative cluster size has been 144 selected in advance and will consider the problem of choosing the number of clusters, n. From equation (10.2), the estimated variance of is where (10.11) The actual variance of is approximately (10.12) where is the population quantity estimated by Because we do not know or the average cluster size, choice of the sample size, that is, the number of clusters necessary to purchase a specified quantity of information concerning a population parameter, is difficult. We overcome this difficulty by using an estimate of and from a prior survey or by selecting a preliminary sample containing n’ units. Thus, as in all problems of selecting a sample size, we equate two standard deviations of our estimator to a bound on the error of estimation, E. This bound is chosen by the experimenter and represents the maximum error that he is willing to tolerate. That is, Using equation (10.12), we can solve for n. We obtain similar results when using to estimate the population total T because The approximate sample size required to estimate : with a bound, E, on the error of estimation: (10.13) 145 where is estimated by and The approximate sample size required to estimate T, using estimation: with a bound, E, on the error of (10.14) where is estimated by and The approximate sample size required to estimate T, using estimation: (10.17) where is estimated by and The estimator of the population proportion P is given by: (10.18) . The estimated variance of p is: 146 with a bound, E, on the error of (10.19) where ai denote the total number of elements in cluster I that possess the characteristic of interest. 147 One-Stage Cluster Sampling Problems 1. A manufacturer of band saws wants to estimate the average repair cost per month for the saws he has sold to certain industries. He cannot obtain a repair cost for each saw, but he can obtain the total amount spent for saw repairs and the number of saws owned by each industry. Thus, he decides to use cluster sampling with each industry as a cluster. The manufacturer selects a simple random sample of n = 20 from the N = 96 industries which he services. The data on total cost of repairs per industry and number of saw per industry are as follows: Industry (cluster) Total repair cost for past month (dollars) Number of saws 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 7 11 9 2 12 14 3 5 9 8 6 3 2 1 4 12 6 5 8 50 110 230 140 60 280 240 45 60 230 140 130 70 50 10 60 280 150 110 120 3= 130 3= 2,565 Estimate the average repair cost per saw for the past month, and place a bound on the error of estimation. Answer: Mean = 19.73077; Error = 1.96 s(mean) = 1.78 2. For the data in Exercise 1, estimate the total amount spent by the 96 industries on bad saw repairs. Place a bound on the error of estimation. Answer: Total = 12312; Error = 1.96 s(Total) = 3175.067 148 3. After checking his sales records, the manufacturer of Exercise 1 finds that he sold a total of 710 bad saws to these industries. Using this additional information, estimate the total amount spent on saw repairs by these industries and place a bound on the error of estimation. Answer: Total = 14,008.846; Error = 1,110.7845 4. The same manufacturer wants to estimate the average repair cost per saw for next month. How many clusters should he select for his sample if he wants the bound on the error of estimation to be less than $2.00? Answer: n = 14 5. A political scientist developed a test designed to measure the degree of awareness of current events. He wants to estimate the average score which would be achieved on this test by all students in a certain high school. The administration at the school would not allow the experimenter to randomly select students out of classes in session, but it would allow him to interrupt a small number of classes for the purpose of giving the test to every member of the class. Thus, the experimenter selects 25 classes at random from the 108 classes in session at a particular hour. The test is given to each member of the sampled classes with the following results: Class 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of Students Total Score Class Number of students Total Score 31 29 25 35 15 31 22 27 25 19 30 18 21 1590 1510 1490 1610 800 1720 1310 1427 1290 860 1620 710 1140 14 15 16 17 18 19 20 21 22 23 24 25 40 38 28 17 22 41 32 35 19 29 18 31 1980 1990 1420 900 1080 2010 1740 1750 890 1470 910 1740 Estimate the average score that would be achieved on this test by all students in the school. Place a bound on the error of estimation. Answer: Mean = 51.56; Error = 1.344 6. The same political scientist of Exercise 5 wants to estimate the average test score for a similar high school. If he wants the bound on the error of estimation to be less than 2 points, how many classes should he sample? Assume the school has 100 classes in session during each hour. 149 Answer: n = 13 7. An industry is considering revision of its retirement policy and wants to estimate the proportion of employees which favor the new policy. The industry consists of 87 separate plants located throughout the United States. Since results must be obtained quickly and with little cost, the industry decides to use cluster sampling with each plant as a cluster. A simple random sample of 15 plants is selected, and the opinions of the employees in these plants are obtained by questionnaire. The results are as follows: Plant Number of employees 1 2 3 4 5 6 7 8 51 62 49 73 101 48 65 49 Number favoring new policy Plant 42 53 40 45 63 31 38 30 9 10 11 12 13 14 15 Number of employees 73 61 58 52 65 49 55 Number favoring new policy 54 45 51 29 46 37 42 Estimate the proportion of employees in the industry who favor the new retirement policy and place a bound on the error of estimation. Answer: Proportion = .709; Error = .048 8. The industry of Exercise 7 modified its retirement policy after obtaining the results of the survey. It now wants to estimate the proportion of employees in favor of the modified policy. How large a sample should be taken to have a bound of 0.08 on the error of estimation? Use the data from Exercise 7 to approximate the results of the new survey. Answer: n = 7 9. An economic survey is designed to estimate the average amount spent on utilities for households in a city. Since no list of households is available, cluster sampling is used with divisions (wards) forming the clusters. A simple random sample of 20 wards is selected from the 60 wards of the city. Interviewers then obtain the cost of utilities from each household within the sampled wards; the total costs are tabulated below: 150 Sampled W ard Number of Households Total Amount Spent on Utilities 55 60 63 58 71 78 69 58 52 71 2210 2390 2430 2380 2760 3110 2780 2370 1990 2810 1 2 3 4 5 6 7 8 9 10 Sampled W ard Number of Households 11 12 13 14 15 16 17 18 19 20 Total Amount Spent on Utilities 73 64 69 58 63 75 78 51 67 70 2930 2470 2830 2370 2390 2870 3210 2430 2730 2880 Estimate the average amount a household in the city spends on utilities, and place a bound on the error of estimation. Answer: Mean = 40.1688; Error = .6406 10. In the above survey the number of households in the city is not known. Estimate the total amount spent on utilities for all households in the city, and place a bound on the error of estimation. Answer: Total = 157,020; Error = 6927.875 11. The economic survey of Exercise 9 is to be performed in a neighboring city of similar structure. The objective is to estimate the total amount spent on utilities by households in the city with a bound of $5,000 on the error of estimation. Use the data in Exercise 9 to find the approximate sample size needed to achieve this bound. Answer: n = 30 12. An inspector wants to estimate the average weight to fill for cereal boxes packaged in a certain factory. The cereal is available to him in cartons containing 12 boxes each. The inspector randomly selects 5 cartons and measures the weight of fill for every box in the sampled cartons, with the following results (in ounces): Carton 1 2 3 4 5 Ounces to fill 16.1 15.9 16.2 15.9 16.0 15.9 16.2 16.0 16.1 15.8 16.1 15.8 15.7 16.2 16.3 16.2 16.0 16.3 16.1 15.7 15.9 16.3 15.8 16.1 16.1 15.8 16.1 16.0 16.3 15.9 16.1 15.8 15.9 15.9 16.0 16.2 15.9 16.0 16.1 16.1 16.0 16.0 16.1 15.9 15.8 15.9 16.1 16.0 15.9 16.0 15.8 16.1 15.9 16.0 16.1 16.0 15.9 16.1 16.0 15.9 Estimate the average weight of fill for boxes packaged by this factory, and place a bound on the error of estimation. Assume that the total number of cartons packaged by the factory is large enough for the finite population correction to be ignored. 151 Answer: Mean = 16.005; Error = 0.0215 13. A newspaper wants to estimate the proportion of voters favoring a certain candidate, “Candidate A,” in a state-wide election. Since it is very expensive to select and interview a simple random sample of registered voters, cluster sampling is used with precincts as clusters. A simple random sample of 50 precincts is selected from the 497 precincts in the state. The newspaper wants to make the estimation on election day, but before final returns are tallied. Therefore, reporters are sent to the polls of each sample precinct to obtain the pertinent information directly from the voters. The results are tabulated below: No. Of Voters Number Favoring A No. Of Voters Number Favoring A 1290 1170 840 1620 1381 1492 1785 2010 974 832 1247 1896 1943 798 1020 1141 1820 680 631 475 935 472 820 933 1171 542 457 983 1462 873 372 621 642 975 1893 1942 971 1143 2041 2530 1567 1493 1271 1873 2142 2380 1693 1661 1555 1492 1957 1143 1187 542 973 1541 1679 982 863 742 1010 1092 1242 973 652 523 831 932 No. Of Voters 843 1066 1171 1213 1741 983 1865 1888 1947 2021 2001 1493 1783 1461 1237 1843 Number Favoring A 321 487 596 782 980 693 1033 987 872 1093 1461 1301 1167 932 481 999 Estimate the proportion of voters favoring Candidate A, and place a bound on the error of estimation. Answer: Proportion = 0.57; Error = 0.0307 14. The same newspaper wants to conduct a similar survey during the next election. How large a sample size will be needed to estimate the proportion of voters favoring a similar candidate with a bound of 0.05 on the error of estimation? Use the data in Exercise 13. Answer: n = 21 15. A forester wishes to estimate the average height of trees on a plantation. The plantation is divided into quarter-acre plots. A simple random sample of 20 plots is selected from the 386 plots on the plantation. All trees on the sampled plots are measured with the following results: 152 Number of Trees Average Height (feet) Number of Trees Average Height (feet) 42 51 49 55 47 58 43 59 48 41 6.2 5.8 6.7 4.9 5.2 6.9 4.3 5.2 5.7 6.1 60 52 61 49 57 63 45 46 62 58 6.3 6.7 5.9 6.1 6.0 4.9 5.3 6.7 6.1 7.0 Estimate the average height of trees on the plantation, and place a bound on the error of estimation. (Hint: the total for cluster I can be found by taking the total number of elements in cluster I times the cluster average). Answer: Mean = 5.91; Error = 0.3224 16. To emphasize safety, a taxi-cab company wants to estimate the proportion of unsafe tires on their 175 cabs. (Ignore spare tires.) It is impractical to select a simple random sample of tires, so cluster sampling is used with each cab as a cluster. A random sample of 25 cabs gives the following number of unsafe tires per cab: 2, 4, 0, 1, 2, 0, 4, 1, 3 , 1 ,2 , 0, 1 1, 2, 2, 4, 1, 0, 0, 3, 1, 2, 2, 1. Estimate the proportion of unsafe tires being used on the company’s cabs, and place a bound on the error of estimation. Answer: Proportion = 0.4; Error = 0.1165 153 CHAPTER 11 CLUSTER SAMPLING VARIANCES 11.1 VARIANCE OF A TWO-STAGE CLUSTER SAMPLE To study the variance of a two-stage cluster sample, it will be useful to review some ideas of stratified sampling. In stratified sampling, the standard error of a sample estimate depends on the within-stratum variances, For each stratum, the variance is defined by the same formula as S² (the total variance of the population) but using only the elements in the ith stratum. We saw that stratified sampling was most useful when the means of the strata were very different. In fact the gains of stratified sampling can be determined by computing the standard deviation among the means of the strata (that is, computing the standard deviation of the numbers weighted by the number of units within each stratum) if the necessary data are available. The square of this weighted standard deviation between cluster (primary sampling units or PSUs, in this case) means is called the between-PSU variance. Similar concepts can be considered in cluster sampling. In fact, there is a close analogy between cluster and stratified sampling. In both cases we group the individual elements into sets before selecting the sample. The difference is that in stratified sampling it is necessary to sample within every one of the sets (the strata); in cluster sampling a sample of the sets (the clusters) is selected and then either all or a sample of the elements within the selected sets is included. The purpose and method of forming the sets is very different in the two cases. 11.1.1 Notation Consider a two-stage design in which second-stage sample units (SSUs) are selected randomly from the elementary units within selected clusters (primary sampling units or PSUs) for interview. N = total number of PSUs (first-stage clusters) in the population. n = number of selected sample PSUs. M= m= = total number of elementary units (second-stage units) in the population = total number of second-stage units (SSUs) in the sample Mi = number of SSUs in the i-th PSU (cluster), where i = 1,..., N = avg. number of SSUs per PSU in the population or avg. cluster size mi = number of SSUs selected for the sample in the ith PSU, i = 1,..., n = average number of SSUs per sample PSU = value of a characteristic for the jth elementary unit in the ith PSU in the population = total value of the characteristic in the ith PSU in the population = total value of the characteristic in the population = average value of the population characteristic in the ith PSU = average value of the characteristic per PSU (cluster) in the population = Population mean per unit. = value of the characteristic for the jth sample SSU in the ith sample PSU = total value of characteristic in the ith sample PSU = sample average of the characteristic in the ith sample PSU = Population variance between PSU totals = within PSU variance in the ith PSU (for population). 155 11.1.2 Estimates of Means and Totals The formulas given in previous chapters for estimating population means are appropriate when the sampling unit is identical with the unit of analysis. An important characteristic of cluster sampling, however, is that the sampling unit (at least in the first stage) is not the unit of analysis. Thus, in the examples in the previous chapter, we would probably not be interested in the mean per family, per school, per factory, or per block. Rather, we would be interested in estimating the mean per family member, per school child, per factory worker, or per housing unit. Consider a two-stage design in which the second stage units are the units of analysis; n clusters are selected from among N clusters by simple random sampling; and mi units are selected in the ith PSU using simple random sampling for i = 1, ... , n. Within the ith cluster, the population mean per unit is given by (11.1) Since the units within the cluster were selected by simple random sampling, we know (from chapter 4, section 2) that we can estimate this mean without bias, by using the following formula: (11.2) These estimates of the cluster unit means from the n sample clusters must then be combined in some way to estimate the overall population total (Y) and the population mean per unit given by the following formula: Several estimators are available and are discussed in most standard texts: we shall examine only one of these. First, we shall construct an estimator for the population total for the Y-characteristic. An unbiased estimator for Yi, the ith PSU total is given by (11.3) 156 An unbiased estimator for the population total is then given by (11.4) Similarly, we can estimate the total number of units of analysis in the population (assuming that we do not know it) by (11.5) The population mean per unit is An estimator of is (11.6) As can be seen, this estimator is a weighted mean of the n sample cluster means per unit where the weights are the corresponding cluster sizes. As indicated previously, this is only one of several possible estimators; however, this estimator seems to be most generally useful. Since both the numerator and denominator are random variables, this is a ratio-type estimator and it has the usual bias of a ratio estimator. The bias will usually not be serious if the number of clusters in the sample is reasonably large. 11.1.3 Variances Consider the case when n PSUs are selected from a population of N PSUs and random samples of mi (i = 1,...,n) SSUs are taken from the Mi (i=1,...,N) SSUs in the selected PSUs. Then the variance of , the estimator of Y is (11.7) 157 where, (11.8) and (11.9) The variance of the estimator of Y is the sum of two components. The first component is the contribution to the variance arising from the selection of first-stage units. The second component is the contribution from the selection of second-stage units. If there are three or more stages of sampling, the variance will include additional terms similar in form for each additional stage. The sample estimator of is (11.10) This estimator is unbiased for although is not unbiased for Now, (11.11) and, (11.12) The variance of the estimator of is more complex. It is given approximately by: The approximate value of the variance of may also be obtained from equation (11.7) as follows: (11.13) 158 (11.14) and is estimated using (11.15) If all PSUs have the same number of second-stage units M and a constant number m of them is sampled from every sample PSU, we have and (11.16) In this case, the variance of is (11.17) where, and (11.18) The sample estimate of is given by: (11.19) where, (11.20) 159 The variance of an estimated mean is (11.21) and is estimated using (11.22) 11.1.3.1 Illustration A population consists of four clusters of five households each. The second-stage units, which are also the elementary units in this case, are houses having persons as follows: Cluster Household 1 2 3 4 1 3 8 4 7 2 10 3 6 2 3 9 6 3 6 4 8 4 8 4 5 6 5 6 6 First, select two clusters at random from a population of four clusters. Then within each of these selected clusters take a random sample of three households. Compute population total Y and the variance of the estimate of the Find the variance and the coefficient of variation of the estimate of In this case, N = 4, n = 2, Mi= = 5, and mi = =3 Suppose that clusters 3 and 4 are selected at random. Assume also that households 1, 2, and 5 within cluster 4 and households 2, 4, and 5 within cluster 3 are selected at random. Then we have, 160 Using equation (11.19), where , and We have, and On substitution, we have, The average number of persons per household is estimated by: The estimated variance of this estimate is: 161 The standard error of is: and the coefficient of variation of 11.1.4 is: Random Group Method of Approximating Variances The above formulas are somewhat cumbersome. Consequently, short-cut approximations are often used to reduce the amount of work, particularly if variance estimates are to be computed for a large number of characteristics. One of these approximations is known as the random group method. The random group method consists of dividing the sample into a number of groups at random; each group is then used to make an estimate of the total, mean, etc. (this would be done for each characteristic for which a variance is to be computed). Each of the random groups will reflect the various steps of the sample selection so that the estimate from each group is an estimate of the total with the same sample design as the whole sample (but with a much smaller sample size). In a multistage sample, the random groups are usually formed by placing the entire sample from a primary sampling unit in a single group. For complex designs using stratification and/or sampling over time, somewhat different methods are available to divide the sample into random groups. However, the method is not very useful if the number of first-stage units is small. In computing the estimates of variance, it is exactly the variance between different possible estimates of the total or mean in which we are interested. Therefore, this method which provides a number of different estimates of the total or mean, each with some degree of stability (that is, the number of cases in a group should not be too small) is a realistic one for estimating variances. 11.2 LIMITING FORMS OF VARIANCE OF TWO-STAGE SAMPLE Examining the variance equation (11.7) and equation (11.9), we can easily see what happens in two simple situations. First, if all second-stage units are included in the sample we have the case described in chapter 9 as "single-stage cluster sampling." In this case, mi = Mi and the term arising from variation within first-stage units is zero. In equation (11.7), the first term is the same as the variance formula for simple random sampling except that the sample sizes and values of Yi refer to the first-stage units. For example, if area segments were the first-stage units, N is the total number of area segments and Yi is the segment total for the variable. In equation (11.9), the first term is the 162 between cluster component of the overall variance which is based on the differences among cluster means per unit of analysis rather than on differences among cluster totals. Secondly, consider a situation in which all first-stage units are in the sample. In this case, n = N and the first term becomes zero. The variance of the estimator of the population total becomes equal to The variance of the estimator of the population mean per element is then equal to These are the variance formulas for the estimators of totals and means from a stratified sample. In other words, a stratified sample is simply a special case of a cluster sample in which all first-stage units are included in the sample and a subsample of second-stage units is selected from each firststage unit. This discussion has covered only the case of simple random sampling for both the first-stage and second-stage selections. Analogous formulas can be developed for stratified cluster sampling in which the only difference is that the terms in the equations are replaced by the sums of similar terms over strata. 11.3 ANALYSIS OF COMPONENTS OF VARIANCE A more detailed analysis of equation (11.7) and equation (11.13) would show that for a two-stage sample containing a given total number of units of analysis, the sampling variances of estimates computed by equation (11.4) and equation (11.6) depend on several factors. Two important factors which the sampling statistician must consider in designing the sample are: (1) The variability in size of first-stage units in terms of the number of second-stage units they contain. (2) The variability among second-stage units (the elementary units or units of analysis) within firststage units. 11.3.1 Variability in Size of First-Stage Units If the first-stage units are unequal in size in terms of the number of second-stage units (for example, the number of holdings in an area segment), these variations in size can have a profound effect on the size of the variance of the estimator of the population total, as shown by the first term in equation 163 (11.7). We can see in equation (11.13) that the variance of the estimator of the population mean per elementary unit is affected by the variation among first-stage means per element. If the variability in size is very great, it will be necessary to use a large sample of first-stage units or to change the sampling and estimating methods to keep the standard error within reasonable bounds (see section 11.4 below). 11.3.2 Variability Among Second-Stage Units The second important factor is the variability among second-stage units (units of analysis) within first-stage units (clusters). For a given sampling plan in which we select n out of N clusters and an average of units of analysis out of each sample cluster, it can be shown that the greater the variability among second-stage units within first-stage units, the smaller will be the sampling variability of resulting estimates. In other words, it is desirable that the units of analysis have a relatively low intraclass correlation. Intraclass correlation is a measure of similarity among units within a cluster with regard to the characteristics being investigated. A mathematical demonstration of this phenomenon is beyond the scope of this chapter; however, by considering an extreme example we can gain an intuitive understanding of it. Consider a situation in which the units of analysis within each cluster are identical. Clearly, a sampling plan such as described above would not be efficient. A single unit of analysis within a given cluster would provide complete information about all the units; consequently, the remaining units would contribute nothing additional to our knowledge. To include them in the sample would be a waste of resources. The inefficiency of this design in this situation would be reflected in a high sampling variability relative to a simple random sample with the same number of units of analysis. The statistician must consider the effect of intraclass correlation on the sampling variability when designing a sample. This is particularly true of area sampling since units which are close together geographically are usually quite similar for many characteristics such as income, education, attitudes, type of agricultural activity, etc. The usual approach is to limit the number of units of analysis taken from the first-stage units and include more of the first-stage units in the sample. In a single-stage sample, the statistician can do this by making the clusters as small as practicable. The more common approach, however, is to introduce additional stages in the sampling procedure so that the number of units of analysis ultimately selected from each unit at the last stage is small. The statistician must, of course, balance precision against cost in deciding on a sampling plan. Notice that in cluster sampling we gain by having units within clusters as unlike as possible, but in stratified sampling we gain by having units within strata as much alike as possible. The reason for this difference becomes clear when you recall from section 11.2 above that in stratified sampling, the "between-cluster" component of the variance drops out of the equation entirely. 11.4 CONTROL OF VARIABILITY IN SIZE OF CLUSTER In all of this discussion, it has been assumed that the only way we could affect the sampling variance, with the given population, is to take more or fewer sample cases in the first or second stages or to vary the size of the first-stage units. Of course, if the sampling variance can be reduced by 164 appropriate stratification, this should be done first. Several special procedures are also available to control the effect of variability in size of cluster. The most important procedure is described below. Although this discussion is related to a two-stage sample, a similar analysis could be made for three or more stages. The procedures described below for controlling variability in size apply equally well to first, second, or other stages, whenever cluster sampling is used. 11.4.1 Define Clusters of Equal Size One obvious method is to attempt to define clusters in such a way that they are approximately equal in size in terms of the number of units of analysis with the expectation that this will tend to make them equal also in terms of characteristics being investigated. If this can be done with available materials and information, then no other action is necessary. For example, if block counts of numbers of housing units are available for cities and villages, it may be possible to group small blocks together to make clusters which contain approximately the same number of housing units. In some cases, it may be feasible to define clusters directly in terms of a characteristic being investigated. For example, in an agricultural survey, clusters can be constructed to be nearly equal in area. If recent aerial photographs are available, they might even be made nearly equal in terms of cultivated area. 11.4.2 Stratify Clusters by Size If information is available on the size of all the first-stage clusters in the universe in advance of the survey (reasonably good approximations are adequate), it is possible to stratify the clusters by size group. The effect of stratification is to replace a total variance by a sum of within-stratum variances. Within each stratum, the clusters should be about equal in size; therefore, stratification by size of cluster will have about the same effect as making all clusters in the whole population about equal in size. If information on size is not available, it may be worthwhile to spend a small amount of the available resources, for example, in making a "Quick Count" of city blocks in order to obtain approximate sizes of the first-stage units (in terms of the number of housing units they contain). Errors in counts do not cause biases in the estimates, which are based on the actual numbers of housing units found in the survey itself. Either optimum or proportionate sampling can be performed depending on which appears most useful in the particular case. If more than one characteristic is being estimated, proportionate sampling may be preferable to optimum allocation, since the optimum allocation might be different for each characteristic. Also, proportionate sampling is usually safer unless very good measures of size are available, since the use of the optimum allocation formula with poor measures of size may actually increase the variance. 11.4.3 Use of Ratio Estimates A third method of reducing the effect of variability in cluster size is through the use of ratio 165 estimates. Ratio estimates were discussed in detail in Chapter 9; an example of the method is given here. A ratio estimate makes use of a quantity of the form where both and are estimates of totals made from sample data. X, the universe total of the quantity of which is an estimate, must be known (it may be a projection or other figure which is believed to be very close to the true value). One can make a ratio estimate of the universe total Y--an estimate which is frequently very efficient--by using instead of alone. The new estimate of thus differs significantly from since it involves two items having sampling variances instead of one. Ratio estimates are generally much less sensitive to variation in size of cluster than estimates of the type and their use will frequently reduce the standard errors appreciably. 11.4.3.1 Ratio to Approximate Number of Units of Analysis Two different uses of ratio estimates for this purpose will be discussed. In the first, "X" is a variable closely related to the total number of units of analysis in the clusters and is an estimate, based on the sample clusters only, of the population aggregate, X. For example, consider a sample design in which city blocks are the first-stage units, and housing units are both the second-stage units and the units of analysis. We have rough counts (Xi) of housing units for each block based on a previous census or special counts made for this purpose. These counts can be totaled for all blocks in the city to obtain X. Then a sample estimate of X, can be obtained by adding up the rough counts for the sample blocks only, and multiplying this by (where N is the total number of blocks in the city and n is the number in the sample). Then, a ratio estimate of Y is If subsampling is used within the first-stage units, the procedure would be modified. In order to make the fullest gain with this type of ratio estimate, it is advisable not to subsample independently within the clusters, but to treat the second-stage units within the clusters as a continuous list and sample systematically throughout. 166 11.4.3.2 Ratio to a Correlated Statistic In a second use of ratio estimates the true value of some universe total X is known and a sample estimate (of X) can be obtained in the survey. If the characteristics "Y" and "X" are positively correlated, then will reduce the effect of variability in cluster size (and possibly other types of variability as well). For example, suppose a survey is planned to measure the total wage and salary earnings of factory workers (Y). We can do this by taking a sample of factories (the clusters) and including all workers within the sample of factories. Suppose the total sales of all factories can be found (X) from some other source--tax records, for example. We could then include on our questionnaire to the sample factories a question on total sales (xi) as well as wage and salary payments (yi), and we could prepare estimates of population totals for both characteristics from the sample in the usual manner. The ratio estimate of wages and salaries would then be . 11.4.4 Use of Probability Proportionate to Size A fourth method for controlling the effects of variability in cluster size is to select the sample clusters with probability proportionate to size instead of using a simple random sample of clusters. Probability proportionate to size is frequently abbreviated as PPS. Selection with PPS means that a cluster which is, for example, 5 times as large as another, will have 5 times the chance to be in the sample. It might appear, at first, that this would introduce a bias in the sample result, with some clusters over represented and others under represented. When PPS is used, the unbiased estimate of the total, where there is no subsampling, is Here Yi is the total in the ith cluster in the sample and Pi is the probability of selection of this cluster. It can easily be shown that this provides an unbiased estimate of Y. 11.4.4.1 Two-stage Sampling A common application of sampling with PPS is the use of PPS for the selection of the first-stage units in a two-stage sample. When this is done, the subsampling rates are usually set as inversely proportional to size. As a result, the chance of any second-stage unit being included in the sample is the product of the probability of the first-stage and second-stage selections. All second-stage units therefore have identical probabilities and the sample is self-weighting. There are a number of other advantages to this type of selection procedure; for example, the workload can be made approximately the same for all selected first-stage units. Moreover, the estimates will have smaller variances than those from a proportionate sample in which the first-stage units are selected with equal probabilities. 167 11.4.4.2 Measures of Size In order to select with PPS, it is necessary to have measures of size of each cluster in the population. If measures of size are not available, it will usually be found worth the effort to prepare crude estimates of size. (Rough approximations will be almost as effective as more exact measures.) Let us assume such measures are available. The mechanics for selecting a sample with PPS can best be described through an illustration. 11.4.4.3 Illustration Suppose the clusters are blocks and we wish to sample the housing units in a universe made up of the 10 blocks as listed in column (1) of Table 11.1. We would list, in column (2), the measure of size for each block (this may be a rough estimate of the number of housing units), and cumulate the measures of size in column (3). The last figure in column (3) is the total number (rough estimate) of housing units in all 10 blocks. Let us assume that we wish to include in the sample 5 blocks out of the 10, and that the sample is to include 10 percent of all the housing units. Table 11.1 SELECTION OF SAMPLE BLOCKS Block number (PSU) Measure of size Cumulative Measure Sample designation Probability of selection (Pi) = nh *(Mhi/Mh ) Within Cluster Sampling Rate mhi/Mhi (1) (2) (3) (4) (5) (6) 1 50 0 - 50 22.5 50 ÷ 60.2 60.2 ÷ 500 2 12 51 - 62 3 20 63 - 82 4 31 83 - 113 82.7 31 ÷ 60.2 60.2 ÷ 310 5 10 114 - 123 6 60 124 - 183 142.9 60 ÷ 60.2 60.2 ÷ 600 7 55 184 - 238 203.1 55 ÷ 60.2 60.2 ÷ 550 8 13 239 - 251 9 30 252 - 281 263.3 30 ÷ 60.2 60.2 ÷ 300 10 20 282 - 301 After completing the first three columns of Table 11.1 as shown, proceed as follows: (1) Since there are 5 blocks in the sample, divide the final cumulative measure (301) by 5; this 168 gives 60.2, which is the "sampling interval" for selecting blocks. (2) Choose a random number between 00.1 and 60.2; suppose the number happens to be 22.5. This number is called the Random Start (RS). (3) Use this random number as the starting number and enter it in column (4), on the line whose cumulative measure interval includes the number 22.5. In our example, the cumulative measure interval is [0 - 50]. (4) Add the sampling interval (60.2) to the random start (22.5), that is add 60.2 to 22.5. This number is equal to 82.7; enter 82.7 on the line whose cumulative measure interval contains this number. In our case, the interval is [83 - 113]. Continue adding 60.2 to the last number obtained (82.7 in our case) and obtain the next one: 142.9. Locate the interval which contains 142.9. In our case the interval is [124 - 183]. Continue with this procedure until a number is reached which is larger than the last cumulative measure. (5) The blocks with entries in column (4) are the ones in the sample. In this example, they are blocks 1, 4, 6, 7, and 9. (6) The probability (Pi) of selection of each block actually selected is entered in column (5). For each block, the probability is the measure of size in column (2) divided by the sampling interval 60.2. (7) The sampling rate to be used within each selected block is computed and entered in column (6). For each block, the rate is the desired overall probability of selection, namely 1/10, divided by the entry in column (5). Thus, for block 1, the rate in column (6) would be or (8) It occasionally happens that some of the blocks are so large that the measures of size are greater than the sampling interval. As a result, there may be two or more entries in column (4) for the same block. In such a case, the subsampling rate within the block is adjusted to make the overall probability for the selection of housing units equal to 1/10, in our example. 169 Two-Stage Cluster Sampling 1. A nurseryman wants to estimate the average height of seedlings in a large field that is divided into 50 plots that vary slightly in size. He believes the heights are fairly constant throughout each plot, but may vary considerably from plot to plot. Therefore, it is decided to sample 10% of the trees within each of 10 plots using a two-stage cluster sample. The data are as follows: Plot Number of seedlings 1 2 3 4 5 6 7 8 9 10 Number of seedlings planted 52 56 60 46 49 51 50 61 60 45 5 6 6 5 5 5 5 6 6 6 Height of seedlings (inches) 12, 11, 12, 10, 13 10, 9, 7, 9, 8, 10 6, 5, 7, 5, 6, 4 7, 8, 7, 7, 6 10, 11, 13, 12, 12 14, 15, 13, 12, 13 6, 7, 6, 8, 7 9, 10, 8, 9, 9, 10 7, 10, 8, 9, 9, 10 12, 11, 12, 13, 12, 12 Estimate the average height of seedlings in the field, and place a bound on the error of estimation. Answer: Mean = 9.3789; Error = 1.4536 2. In Exercise 1, assume that the nurseryman knows there are approximately 2600 seedlings in the field. Use this additional information to estimate the average height, and place a bound on the error of estimation. Answer: Mean = 9.5593; Error = 1.3672 3. A supermarket chain has stores in 32 cities. A company official wants to estimate the proportion of stores in the chain which do not meet a specified cleanliness criterion. Stores within each city appear to possess similar characteristics; therefore, it is decided to select a two-stage cluster sample containing one-half of the stores within each of four cities. Cluster sampling is desirable in this situation because of travel costs. The data collected are as follows: City Number of stores in city Number of stores sampled Number of stores not meeting criterion 1 2 3 4 25 10 18 16 13 5 9 8 3 1 4 2 170 Estimate the proportion of stores not meeting the cleanliness criterion, and place a bound on the error of estimation. Answer: Proportion = .2865; Error = .1116 4. Repeat Exercise 3 given that the chain contains 450 stores. Answer: Proportion = .351; Error = .1767 5. To improve telephone service, an executive of a certain company wants to estimate the total number of phone calls placed by secretaries in the company during one day. The company contains 12 departments, each making approximately the same number of calls per day. Each department employs approximately 20 secretaries, and the number of calls made varies considerably from secretary to secretary. It is decided to employ two-stage cluster sampling using a small number of departments (cluster) and selecting a fairly large number of secretaries (elements) from each. Ten secretaries are sampled from each of four departments. The data are summarized in the following table: Department Number of Secretaries Number of secretaries sampled Mean Variance 1 2 3 4 21 23 20 20 10 10 10 10 15.5 15.8 17.0 14.9 2.8 3.1 3.5 3.4 Estimate the total number of calls placed by the secretaries in this company, and place a bound on the error of estimation. Answer: Total = 3,980.7; Error = 274.7317 6. A city zoning commission wants to estimate the proportion of property owners in a certain section of a city who favor a proposed zoning change. The section is divided into 7 distinct residential areas, each containing similar residents. Because the results must be obtained in a short period of time, two-stage cluster sampling is used. Three of the 7 areas are selected at random and 20% of the property owners in each area selected are sampled. The figure of 20% seems reasonable because the people living within each area seem to be in the same socioeconomic class and, hence, they tend to hold similar opinions on the zoning question. The results are as follows: Area Number of property owners Number of property owners sampled Number in favor of zoning change 1 2 3 46 67 93 9 13 20 1 2 2 171 Estimate the proportion of property owners who favor the proposed zoning change, and place a bound on the error of estimation. Answer: Proportion = .1200; Error = .0667 7. A forester wants to estimate the total number of trees in a certain county which are infected with a particular disease. There are ten well-defined forest areas in the country; these areas can be subdivided into plots of approximately the same size. Four crews are available to conduct the survey, which must be completed in one day. Hence, two-stage cluster sampling is used. Four areas (clusters) are chosen with 6 plots (elements) randomly selected from each. (Each crew can survey six plots in one day). The data are as follows: Area Number of plots Number of plots sampled 1 2 3 4 12 15 14 21 6 6 6 6 Number of infected trees per plot 15, 14, 21, 13, 9, 10 4, 6, 10, 9, 8, 5 10, 11, 14, 10, 9, 15 8, 3, 4, 1, 2, 5 Estimate the total number of infected trees in the county, and place a bound on the error of estimation. Answer: Total = 1,276.2425; Error = 333.4435 8. A new bottling machine is being tested by a company. During a test run, the machine fills 24 cases, each containing a dozen bottles. It is desired to estimate the average number of ounces of fill per bottle. A two -stage cluster sample is employed using 6 cases (clusters) with 4 bottles (elements) randomly selected from each. The results are as follows: Case 1 2 3 4 5 6 Average ounces of fill for sample ( ) Sample Variance 7.9 8.0 7.8 7.9 8.1 7.9 0.15 0.12 0.09 0.11 0.10 0.12 ( ) Estimate the average number of ounces per bottle, and place a bound on the error of estimation. Answer: Mean = 7.9333; Error = 0.0924 172 9. A population consists of four clusters. The second-stage units, which are also the elementary units in this case, are houses having rental values as follows: Cluster 1 Cluster 2 Cluster 3 Cluster 4 $100 $100 $10 $50 100 100 20 90 200 TOTALS 40 400 ____ 50 ____ 800 200 120 140 a. What is the value of (the between-cluster variance)? b. What is the value of the within-cluster variance for the first cluster? c. A sample of two clusters is selected with equal probability; within each selected cluster, half the elementary units are in the sample. (i) How would you compute the estimate of Y? (ii) What is the variance of the sample estimate of the total (iii) What is the probability that any elementary unit will be in the sample? (iv) Compute the coefficient of variation for the estimate. 10. The following table shows areas of cacao holdings of 15 farmers in five clusters (PSUs) of equal size. The five clusters were selected at random from a total of 40 clusters into which the territory had been divided. Each PSU represents a geographic division containing 120 cacao farmers. Area of Cacao Holdings TOTALS Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 96 110 102 140 132 134 121 113 142 162 152 146 157 161 184 382 377 372 443 478 a. Estimate the total area of cacao for the territory. b. Estimate the average area of cacao per farm. 173 c. d. 11. Compute the standard errors for the estimates given in exercises (a) and (b). Compute the coefficient of variation for the estimates given in exercises (a) and (b). Assume a city with 12 blocks, as listed in the first column below. Measures of size (approximate number of housing units in each block) are given in the second column. On the basis of this information, we wish to select a sample of 4 blocks with probability proportionate to size, and then to select housing units within the blocks in order to obtain a self-weighting sample of an expected 10 housing units. Block Number (PSU) Approximate Number of Housing Units (measure of size) Cumulative Measure Actual Number of Housing Units* Serial Numbers of Actual Housing Units 1 10 10 9 1 to 9 2 5 15 6 10 to 15 3 2 17 2 16 to 17 4 5 22 6 18 to 23 5 5 27 6 24 to 29 6 10 37 8 30 to 37 7 10 47 8 38 to 45 8 2 49 2 46 to 47 9 2 51 4 48 to 51 10 5 56 6 52 to 57 11 5 61 6 58 to 63 12 10 71 9 64 to 72 TOTALS 71 72 * The number of housing units that would actually be found in the block in a field operation if the block were selected in the sample. a. Prepare a worksheet showing the selection of the sample of blocks. Assume 3.7 is the random start number for designating the sample blocks. b. Assume that you have visited the blocks selected in your sample and determine the actual number of housing units as given in the fourth column above. The housing units that actually exist in each block are designated by "Serial Numbers" as shown in the fifth column. Perform necessary computations for selecting the sample of housing units and list the Serial Numbers for the housing units selected in your sample. c. Consider the list of 600 households of 30 villages located in 3 zones (See Appendix IV). 174 Using a two-stage cluster sample design, it is desired to estimate the total number of persons in the population. A random sample of four clusters is chosen and five households in each sampled cluster are randomly selected. Assume households and consider the village as the cluster (PSU) for the survey. d. Estimate the total number of persons in the population. e. Compute the standard error for f. Determine the coefficient of variation for g. Construct a 95 percent confidence interval for 175 and interpret the result. CHAPTER 12 NONRESPONSE The best way to deal with nonresponse is to prevent it. After nonresponse has occurred, it is sometimes possible to model the missing data, but predicting the missing observations is never as good as observing them in the first place. Nonrespondents often differ in critical ways from respondents; if the nonresponse rate is not negligible, inference based upon only the respondents may be seriously flawed. We discuss two types of nonresponse in this chapter: unit nonresponse, in which the entire observation unit is missing, and item nonresponse, in which some measurements are present for the observation unit but at least one item is missing. In a survey of persons, unit nonresponse means that the person provides no information for the survey; item nonresponse means that the person does not respond to a particular item on the questionnaire. In the Current Population survey and the National Crime Victimization survey (NCVS), unit nonresponse can arise for a variety of reasons: the interviewer may not be able to contact the households; the person may be ill and cannot respond to the survey; the person may refuse to participate in the survey. In these surveys, the interviewer tries to get demographic information about the nonrespondent, such as age, sex, and race, as well as characteristics of the dwelling unit, such as urban/rural status; this information can be used later to adjust for the nonresponse. Item nonresponse occurs largely because of refusals: a household may decline to give information about income, for example. In agriculture or wildlife surveys, the term missing data is generally used instead of nonresponse, but the concepts and remedies are similar. In a survey of breeding ducks, for example, some birds will be not be found by the researchers; they are, in a sense, nonrespondents. The nest may be raided by predators before the investigator can determine how many eggs were laid; this is comparable to item nonresponse. In this chapter, we discuss four approaches to dealing with nonresponse: 1. Prevent it. Design the survey so that nonresponse is low. This is by far the best method. 2. Take a representative subsample of the nonrespondents; use that subsample to make inferences about the other nonrespondents. 3. Use a model to predict values for the nonrespondents. Weights implicitly use a model to adjust for unit nonresponse. Imputation often adjusts for item nonresponse, and parametric models may be used for either type of nonresponse. 4. Ignore the nonresponse (not recommended, but unfortunately common in practice.) 12.1 Effects of Ignoring Nonresponse Example 12.1 Thomas and Siring (1983) report results from a 1969 survey on voting behavior carried out by the Central Bureau of Statistics in Norway. In this survey, three calls were followed by a mail survey. The final nonresponse rate was 9.9%, which is often considered to be a small nonresponse rate. Did the nonrespondents differ from the respondents? In the Norwegian voting register, it was possible to find out whether a person voted in the election. The percentage of persons who voted could then be compared for respondents and nonrespondents; Table 12.1 shows the results. The selected sample is all persons selected to be in the sample, including data from the Norwegian voting register for both respondents and nonrespondents. The difference in voting rate between the nonrespondents and the selected sample was largest in the younger age groups. Among the nonrespondents, the voting rate varied with the type of nonresponse. The overall voting rate for the persons who refused to participate in the survey was 81%, the voting rate for the not-at-homes was 65%, and the voting rate for the mentally and physically ill was 55%, implying that absence or illness were the primary causes of nonresponse bias. Table 12.1 Percentage of Persons Who Voted Age All 20-24 25-29 30-49 50-69 70-79 Nonrespondents 71 59 56 72 78 74 Selected Sample 88 81 84 90 91 84 It has been demonstrated repeatedly that nonresponse can have large effects on the results of a survey–in example 12.1, a nonresponse rate of less than 10% led to an overestimate of voting rate in Norway. Holt and Elliot discuss the results of a series of studies done on nonresponse in the United Kingdom, indicating that “lower response rates are associated with the following characteristics: London residents; households with no car; single people; children couples; older people; divorced/widowed people; new Commonwealth origin; lower educational attainment; selfemployed” (1991, 334.) Moreover, increasing the sample size without targeting nonresponse does nothing to reduce nonresponse bias; a larger sample size merely provides more observations from the class of persons that would respond to the survey. Increasing the sample size may actually worsen the nonresponse bias, as the larger sample size may divert resources that could have been used to reduce or remedy the nonresponse or it may result in less care in the data collection. Recall that the infamous Literary Digest Survey of 1936 (see Annex 1) had 2.4 million respondents but a response rate of less than 25%. The U. S. decennial census itself does not include the entire population, and the undercoverage rate varies for different demographic groups. In the early 1990s, the nonresponse and undercoverage in the U. S. Census prompted a lawsuit from certain cities to force the Census Bureau to adjust for the nonresponse, and the debate about census adjustment continues. 177 Most small surveys ignore any nonresponse that remains after callbacks and follow-ups, and report results based on complete records only. Hite (1987) did so in her survey and much of the criticism of her results was based on her low response rate. Nonresponse is also ignored for many surveys reported in newspapers, both local and national. An analysis of complete records has the underlying assumption that the nonrespondents are similar to the respondents and that units with missing items are similar to units that have responses for every question. Much evidence indicates that this assumption does not hold true in practice. If nonresponse is ignored in the NCVS, for example, victimization rates are underestimated. Biderman and Cantor (1984) find lower victimization rates for persons who respond in three consecutive interviews than for persons who are nonrespondents in at least one of the those interviews or who move before the panel study is completed. Results reported from an analysis of only complete records should be taken as representative of the population of persons who would respond to the survey, which is rarely the same as the target population. If you insist on estimating population means and totals using only the complete records and making no adjustment for nonrespondents, at the very least you should report the rate of nonresponse. The main problem caused by nonresponse is potential bias of population estimates. Think of the population as being divided into two somewhat artificial strata of respondents and nonrespondents. The population respondents are the units that would respond if they were chosen to be in the sample; the number of population respondents, NR, is unknown. Similarly, the NM (M for missing) population nonrespondents are the units that would not respond. We then have the following population quantities: Stratum Size Total Respondents NR TR Nonrespondents NM TM Entire Population N T Mean Variance The population as a whole has variance with mean and total T. A probability sample from the population will likely contain some respondents and some nonrespondents. But, of course, on the first call we do not observe yi for any of the units in the nonrespondent stratum. If the population mean in the nonrespondent stratum differs from that in the respondent stratum, estimating the population mean using only the 178 respondents will produce bias.1 Let be an approximately unbiased estimator of the mean in the respondent stratum, using only the respondents. Because the bias is approximately The bias is small if either (1) the mean for the nonrespondents is close to the mean for the respondents or (2) (NM/N) is small–there is little nonresponse. But we can never be assured of (1), as we generally have no data for the nonrespondents. Minimizing the nonresponse rate is the only sure way to control nonresponse bias. 12.2 Designing Surveys to Reduce Nonsampling Errors A common feature of poor surveys is a lack of time spent on the design and nonresponse follow-up in the survey. Many persons new to surveys (and some, unfortunately, not new) simply jump in and start collecting data without considering potential problems in the data-collection process; they mail questionnaires to everyone in the target population and analyze those that are returned. It is not surprising that such surveys have poor response rates. Many surveys reported in academic journals on purchasing, for example, have response rates between 10 and 15%. It is difficult to see how anything can be concluded about the population in such a survey. A researcher who knows the target population well will be able to anticipate some of the reasons for nonresponse and prevent some of it. Most investigators, however, do not know as much about reasons for nonresponse as they think they do. They need to discover why the nonresponse occurs and resolve as many of the problems as possible before commencing the survey. These reasons can be discovered through designed experiments and application of qualityimprovement methods to the data collection and processing. You do not know why previous surveys related to yours have a low response rate? Design an experiment to find out. You think errors are introduced in the data recording and processing? Use a nested design to find the sources of errors. Any book on quality control or designed experiments will tell you how to collect your data. And, of course, you can rely on previous researchers’ experiments to help you minimize 1 The variance is often too low as well. In income surveys, for example, the rich and the poor are more likely to be nonrespondents on the income question. In that case, 2 , for the respondent stratum, is smaller than S . The point estimate of the mean may be biased, and the variance estimate may be biased, too. 179 nonsampling errors. The references on experiment design and quality control at the end of the book are a good place to start; Hidoroglou et al. (1993) give a general framework for nonresponse. Example 12.2 The 1990 U. S. decennial census attempted to survey each of the over 100 million households in the United States. The response rate for the mail survey was 65%; households that did not mail in the survey needed to be contacted in person, adding millions of dollars to the cost of the census. Increasing the mail response rate for future censuses would result in tremendous savings. Dillman et al. (1995a) report results of a factorial experiment employed in the 1992 Census Implementation Test, designed to explore the individual effects and interactions of three experimental factors on response rates. The three factors were: (1) a prenotice letter alerting the household to the impending arrival of the census form, (2) a stamped return envelope included with the census form, and (3) a reminder postcard sent a few days after the census form. The results were dramatic, as shown in Figure 12.1. The experiment established that, although all three factors influenced the response rate, the letter and postcard let to greater gains in response rate than the stamped return envelope. Figure 12.1 Response rates achieved for each combination of the factors letter, envelope, and postcard. The observed response rate was 64.3% when all three aids were used and only 50% when non were used. Nonresponse can have many different causes; as a result, no single method can be recommended for every survey. Platek (1977) classifies sources of nonresponse as related to (1) survey content, (2) methods of data collection, and (3) respondent characteristics, and illustrates various sources using the diagram in Figure 12.2. Groves (1989) and Dillman (1978) discuss additional sources of nonresponse. 180 Figure 12.2 Factors Affecting Nonresponse The following are some factors that may influence response rate and data accuracy. # Survey content. A survey on drug use or financial matters may have a large number of refusals. Sometimes the response rate can be increased for sensitive items by careful ordering of the questions or by using a randomized response technique (see Section 12.5). # Time of survey. Some calling periods or seasons of the year may yield higher response rates than others. The vacation month of August, for example, would be a bad time to take a onetime household survey in Germany. # Interviewers. Grower (1979) found a large variability in response rates achieved by different interviewers, with about 15% of interviewers reporting almost no nonresponse. Some field investigators in a bird survey may be better at spotting and identifying birds than others. Standard quality-improvement methods can be applied to increase the response rate and accuracy for interviewers. The same methods can be applied to the data-coding process. # Data-collection method. Generally, telephone and mail surveys have a lower response rate and in-person surveys (they also have lower costs, however). Computer Assisted Telephone Interviewing (CATI) has been demonstrated to improve accuracy of data collected in telephone surveys; with CATI, all questions are displayed on a computer, and the interviewer 181 codes the responses in the computer as questions are asked. CATI is specially helpful in surveys in which a respondent’s answer to one question determines which question is asked next (Catlin and Ingram 1988). Mail, fax, and Internet surveys often have low response rates. Possible reasons for nonresponse in a mail survey should be explored before the questionnaire is mailed: Is the survey sent to the wrong address? Do recipients discard the envelope as junk mail even before opening it? Will the survey reach the intended recipient? Will the recipient believe that filling out the survey is worth the time? # # Questionnaire design. We have already seen that question wording has a large effect on the responses received; it can also affect whether a person responds to an item on the questionnaire. The volume edited by Tamur (1993) explores some recent research on application of cognitive research on question design. In a mail survey, a well-designed form for the respondent may increase data accuracy. Respondent burden. Persons who respond to a survey are doing you an immense favor, and the survey should be as nonintrusive as possible. A shorter questionnaire, requiring less detail, may reduce the burden to the respondent. Respondent burden is a special concern in panel surveys such as the NCVS, in which sampled households are interviewed every six months for 3 ½ years. DeVries et al. (1996) discuss methods used in reducing respondent burden because a smaller sample suffices to give the required precision. # Survey introduction. The survey introduction provides the first contact between the interviewer and potential respondent; a good introduction, giving the recipient motivation to respond, can increase response rates dramatically. Nielsen Media Research emphasizes to households in its selected sample that their participation in the Nielsen ratings affects which television shows are aired. The respondent should be told for what purpose the data will be used (unscrupulous persons often pretend to be taking a survey when they are really trying to attract customers or converts) and assured confidentiality. # Incentives and disincentives. Incentives, financial or otherwise, may increase the response rate. Disincentives may work as well: Physicians who refused to be assessed by peers after selection in a stratified sample from the College of Physicians and Surgeons of Ontario registry had their medical licenses suspended. Not surprisingly, nonresponse was low (McAuley et al. 1990). # Follow-up. The initial contact of the sample is usually less costly per unit than follow-ups of the initial nonrespondents. If the initial survey is by mail, a reminder may increase the response rate. Not everyone responds to follow-up calls, though; some persons will refuse to respond to the survey no matter how often they are contacted. You need to decide how many follow-up calls to make before the marginal returns do not justify the money spent. You should try to obtain at least some information about nonrespondents that can be used later to adjust for the nonresponse, and include surrogate items that can b e used for item nonresponse. True, there is no complete compensation for not having the data, but partial information may be better than none. Information about the race, sex, or age of a nonrespondent may be used later to adjust for 182 nonresponse. Questions about income may well lead to refusals, but questions about cars, employment, or education may be answered and can be used to predict income. If the pretests of the survey indicate a nonresponse problem that you do not know how to prevent, try to design the survey so that at least some information is collected for each observation unit. The quality of survey data is largely determined at the design stage. Fisher’s (1938) words about experiments apply equally well to the design of sample surveys: “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.” Any survey budget needs to allocate sufficient resources for survey design and for nonresponse follow-up. Do not scrimp on the survey design; every hour spent on design may save weeks of remorse later. 12.3 Callbacks and Two-Phase Sampling Virtually all good surveys rely on callbacks to obtain responses from persons not at home for the first try. Analysis of callback data can provide some information about the biases that can be expected from the remaining nonrespondents. Example 12.3 Traugott (1987) analyzed callback data from two 1984 Michigan polls on preference for presidential candidates. The overall response rates for the surveys were about 65%, typical for large political polls. About 21% of the interviewed sample responded on the first call; up to 30 attempts were made to reach persons who did not respond on the first call. Traugott found that later respondents were more likely to be male, older, and Republican than early respondents; while 48% of the respondents who answered the first call supported Reagan and 45% supported Mondale, 59% of the entire sample supported Reagan as opposed to 39% for Mondale. Differing procedures for nonresponse follow-up and persistence in callback may explain some of the inconsistencies among political polls. If nonrespondents resemble late respondents, one might speculate that nonrespondents were more likely to favor Reagan. But nonrespondents do not necessarily resemble the hard-to-reach; persons who absolutely refuse to participate may differ greatly from persons who could not be contacted immediately, and nonrespondents may be more likely to have illnesses or other circumstances preventing participation. We also do not know how likely it is that nonrespondents to the surveys will vote in the election; even if we speculate that they were more likely to favor Reagan, they are not necessarily more likely to vote for Reagan. Often, when the survey is designed so that callbacks will be used, the initial contact is by mail survey; the follow-up calls use a more expensive method such as a personal interview. Hansen and Hurwitz (1946) proposed subsampling the nonrespondents and using two-phase sampling (also called double sampling) for stratification to estimate the population mean or total. The population is divided into two strata, as described in Section 12.1; the two strata are respondents and initial nonrespondents, persons who do not respond in the first call. WE will develop the theory of two-phase sampling for general survey designs in Section 12.1; here, we illustrate how it can be used for nonresponse. 183 In the simplest form of two-phase sampling, randomly select n units in the population. Of these, nR respond and nM do not respond. The values nR and nM, though, are random variables; they will change if a different simple random sample (SRS) is selected. Then, make a second call on a random subsample of 100v% of the nM nonrespondents in the sample, where the subsampling fraction v does not depend on the data collected. Suppose that through some superhuman effort all the targeted nonrespondents are reached. Let be the sample average of the original respondents and (M stands for missing) be the average of the subsampled nonrespondents. The two-phase sampling estimates of the population mean and total are: (12.1) and (12.2) where SR represents the sampled units in the respondent stratum and SM represents the sampled units in the nonrespondent stratum. Note that is a weighted sum of the observed units; the weights are N/n for the respondents and N/(nv) for the subsampled nonrespondents. Because only a subsample was taken in the nonrespondent stratum, each subsampled unit in that stratum represents more units in the population than does a unit in the respondent stratum. The expected value and variance of these estimators are found in Section 12.1. Because appropriately weighted unequal-probability estimator, Theorem 6.2 implies that is an From (12.5), if the finite population corrections can be ignored, we can estimate the variance by If everyone responds in the subsample, two-phase sampling not only removes the nonresponse bias but also accounts for the original nonresponse in the estimated variance. 12.4 Mechanisms for Nonresponse Most surveys have some residual nonresponse even after careful design and follow-up of nonresponse. All methods for fixing up nonresponse are necessarily model-based . If we are to make any inferences about the nonrespondents, we must assume that they are related to respondents in some way. A good nontechnical reference for methods of dealing with nonresponse is Groves 184 (1989); the three-volume set edited by Madow et al. (1983) contains much information on the statistical research on nonresponse up to that date. Dividing population members into two fixed strata of would-be respondents and would-be nonrespondents is fine for thinking about potential nonresponse bias and for two-phase methods. To adjust for nonresponse that remains after all other measures have been taken, we need a more elaborate setup, letting the response or nonresponse of unit I be a random variable. Define the random variable After sampling, the realizations of the response indicator variable are known for the units selected in the sample. A value for yi is recorded if ri, the realization of Ri, is 1. The probability that a unit selected for the sample will respond, is of course unknown but assumed positive. Rosembaum and Rubin (1983) call Mi the propensity score for the ith unit. Suppose that yi is a response of interest and that xi is a vector of information known about unit i in the sample. Information used in the survey design is included in xi. We consider three types of missing data, using the Little and Rubin (1987) terminology of nonresponse classification. Missing Completely at Random if Mi does not depend on xi, yi, or the survey design, the missing data are missing completely at random (MCAR). Such a situation occurs if, for example, someone at the laboratory drops a test tube containing the blood sample of one of the survey participants–there is no reason to think that the dropping of the test tube had anything to do with the white blood cell count.2 If data are MCAR, the respondents are representative of the selected sample. Missing data in the NCVS would be MCAR if the probability of nonresponse is completely unrelated to region of the United States, race, sex, age, or any other variable measured for the sample and if the probability of nonresponse is unrelated to any variables about victimization status. Nonrespondents would be essentially selected at random from the sample. If the response probabilities Mi are all equal and the events {Ri = 1} are conditionally independent of each other and of the sample-selection process given nR, then the data are MCAR. If an SRS of size n is taken, then under this mechanism the respondents will be a simple random subsample of variable size nR. The sample mean of the respondents, is approximately unbiased for the population mean. The MCAR mechanism is implicitly adopted when nonresponse is ignored. 2 Even here, though, the suspicious mind can create a scenario in which the nonresponse might be related to quantities of interest: perhaps workers are less likely to drop test tubes that they believe contain HIV. 185 Missing at Random Covariates, or Ignorable Nonresponse If Mi depends on xi but not on yi, the data are missing at random (MAR); the nonresponse depends only on observed variables. We can successfully model the nonresponse, since we know the values of xi for all sample units. Persons in the NCVS would be missing at random if the probability of responding to the survey depends on race, sex, and age–all known quantities–but does not vary with victimization experience within each age/race/sex class. This is sometimes termed ignorable nonresponse: ignorable means that a model can explain the nonresponse mechanism and that the nonresponse can be ignored after the model accounts for it, not that the nonresponse can be completely ignored and complete-data methods used. Nonignorable Nonresponse If the probability of nonresponse depends on the value of a response variable and cannot be completely explained by values of the x’s, then the nonresponse is nonignorable. This is likely the situation for the NCVS: it is suspected that a person who has been victimized by crime is less likely to respond to the survey than a nonvictim, even if they share the values of all known variables such as race, age, and sex. Crime victims may be more likely to move after a victimization and thus not be included in subsequent NCVS interviews. Models can help in this situation, because the nonresponse probability may also depend on known variables, but cannot completely adjust for the nonresponse. The probabilities of responding, Mi, are useful for thinking about the type of nonresponse. Unfortunately, they are unknown, so we do not know for sure which type of nonresponse is present. We can sometimes distinguish between MCAR and MAR by fitting a model attempting to predict the observed probabilities of response for subgroups from known covariates; if the coefficients in a logistic regression model are significantly different from zero, the missing data are likely not MCAR. Distinguishing between MAR and nonignorable nonresponse is more difficult. In the next section, we discuss a method for estimating the Mi’s. 12.5 Weighting Methods for Nonresponse In previous chapters we have seen how weights can be used in calculating estimates for various sampling schemes (see Sections 4.3, 5.4, and 7.2). The sampling weights are the reciprocals of the probabilities of selection, so an estimate of the population total is For stratification, the weights are wi = (Nh / nh) if unit i is in stratum h; for sampling elements with unequal probabilities, wi = 1 / Bi. Weights can also be used to adjust for nonresponse. Let Zi be the indicator variable for presence in the selected sample, with P(Zi = 1) = Bi. If Ri is independent of Zi, then the probability that unit i will be measured is P(unit i selected in sample and responds) = Bi Mi. The probability of responding, Mi, is estimated for each unit in the sample, using auxiliary information that is known for all units in the selected sample. The final weight for a respondent is 186 then Weighting methods assume that the response probabilities can be estimated from variables known for all units; they assume MAR data. References for more information on weighting are Oh and Scheuren (1983) and Holt and Elliot(1991). 12.5.1. Weighting-Class Adjustment Sampling weights wi have been interpreted as the number of units in the population represented by unit I of the sample. Weighting-class methods extend this approach to compensate for nonsampling errors: variables known for all units in the selected sample are used to form weighting-adjustment classes, and it is hoped that respondents and nonrespondents in the same weighting-adjustment class are similar. Weights of respondents in the weighting-adjustment class are increased so that the respondents represent the nonrespondents’ share of the population as well as their own. Example 12.4 Suppose the age is known for every member of the selected sample and that person i in the selected sample has sampling weight wi = (1 / Bi). Then weighting classes can be formed by dividing the selected sample among different age classes, as Table 12.2 shows. We estimate the response probability for each class by Then the sampling weight for each respondent in class c is multiplied by the weight factor in Table 12.2. The weight of each respondent with age between15 and 24, for example, is multiplied by 1.622. Since there was no nonresponse in the over-65 group, their weights are unchanged. Table 12.2 Illustration of Weighting-Class Adjustment Factors 15-24 25-34 35-44 45-64 65+ Total Sample size 202 220 180 195 203 1000 Respondents 124 187 162 187 203 863 Sum of weights for sample 30322 33013 27046 29272 30451 150104 Sum of weights for respondents 18693 28143 24371 28138 30451 0.6165 0.853 0.9011 0.961 1 1.622 1.173 1.11 1.04 1 W eight factor 187 The probability of response is assumed to be the same within each weighting class, with the implication that within a weighting class, the probability of response does not depend on y. As mentioned earlier, weighting-class methods assume MAR data. The weight for a respondent in weighting class c is To estimate the population total using weighting-class adjustments, let xci = 1 if unit i is in class c, and 0 otherwise. Then let the new weight for respondent i be where wi is the sampling weight for unit i; unit i is a nonrespondent. Then, if unit i is in class c. Assign if and In an SRS, for example, if nc is the number of sample units in class c, ncR is the number of respondents in class c, and is the average for the respondents in class c, then and Example 12.5 The National Crime Victimization Survey To adjust for individual nonresponse in the NCVS, the within-household noninterview adjustment factor (WHHNAF) of Chapter 7 is used. NCVS interviewers gather demographic information on the nonrespondents, and this information is used to classify all persons into 24 weighting-adjustment cells. The cells depend on the age of the person, the relation of the person to the reference person (head of household), and the race of the reference person. For any cell, let WR be the sum of the weights for the respondents and WM be the sum of the weights for the nonrespondents. Then the new weight for a respondent in a cell will be the previous weight multiplied by the weighting-adjustment factor 188 Thus, the weights that would be assigned to nonrespondents are reallocated among respondents with similar (we hope) characteristics. A problem occurs if is too large. If the cell contains more nonrespondents than respondents. In this case, the variance of the estimate increases; if the number of respondents in the cell is small, the weight may not be stable. The U. S. Census Bureau collapses cells to obtain weighting-adjustment factor of 2 or less. If there are fewer than 30 interviewed persons in a cell or if the weighting-adjustment factor is greater than 2, the cell is combined (collapsed) with neighboring cells until the collapsed cell has more than 30 observations and a weight-adjustment factor of 2 or less. Construction of Weighting Classes Weighting-adjustment classes should be constructed as though they were strata; as shown in the next section, weighting adjustment is similar to poststratification. The classes should be formed so that units within each class are as similar as possible with respect to the major variables of interest and so that the response rates vary from class to class. Little (1986) suggests estimating the response probabilities as a function of the known variables (perhaps using logistic regression) and grouping observations into classes based on This approach is preferable to simply using the estimated values of Ni in individual case weights, as the estimated response probabilities may be extremely variable and might cause the final estimates to be unstable. 12.5.2 Post-stratification Post-stratification is similar to weighting-class adjustment, except that population counts are used to adjust the weights. Suppose an SRS is taken. After the sample is collected, units are grouped into H different post-strata, usually based on demographic variables such as race or sex. The population has Nh units in post-stratum h; of these, nh were selected for the sample and nhR responded. The poststratified estimator for is the weighting-class estimator for if the weighting classes are the post-strata, is 189 The two estimators are similar in form; the only difference is that in post-stratification the Nh are known, whereas in weighting-class adjustments the Nh are unknown and estimated by (Nnh / n). For the post-stratified estimator, often the conditional variance given the nhR is used. For an SRS, (12.3) The unconditional variance of is slightly larger, with additional terms of order as given in Oh and Scheuren (1983). A variance estimator for post-stratification will be given in Exercise 5 of Chapter 9. 12.5.2.1. Post-stratification Using Weights In a general survey design, the sum of the weights in subgroup h is supposed to estimate the population count Nh for that subgroup. Post-stratification uses the ratio estimator within each subgroup to adjust by the true population count. Let Then, let Using the modified weights, and the post-stratified estimator of the population total is Post-stratification can adjust for undercoverage as well as nonresponse if the population count Nh includes individuals not in the sampling frame for the survey. 190 Example 12.6 The second stage factor in the NCVS (see Section 7.6) uses post-stratification to adjust the weights. After all other weighting adjustments have been done, including the weighting-class adjustments for nonresponse, post-stratification is used to make the sample counts agree with estimates of the population counts from the U. S. Census Bureau. Each person is assigned to one of 72 post-strata based on the person’s age, race, and sex. The number of persons in the population falling in that post-stratum, Nh, is known from other sources. Then, the weight for a person in post-stratum h is multiplied by With weighting classes, the weighting factor to adjust for unit nonresponse is always at least 1. With post-stratification, because weights are adjusted so that they sum to a known population total, the weighting factor can be any positive number, although weighting factors of 2 or less are desirable. Post-stratification assumes that: (1) withing each post-stratum each unit is selected to be in the sample has the same probability of being a respondent, (2) the response or nonresponse of a unit is independent of the behavior of all other units, and (3) nonrespondents in a post-stratum are like the respondents. The data are MCAR within each post-stratum. These are big assumptions; to make them seem a little more plausible, survey researchers often use many post-strata. But a large number of post-strata may create additional problems, in that few respondents in some post-strata may result in unstable estimates, and may preclude the application of the central limit theorem. If faced with post-strata with few observations, most practitioners collapse the post-strata with others that have similar means in key variables until they have a reasonable number of observations in each post-stratum. For the Current Population Survey, a “reasonable” number means that each group has at least 20 observations and that the response rate for each group is at least 50%. 12.5.2.2. Raking Adjustments Raking is a post-stratification method that can be used when post-strata are formed using more than one variable, but only the marginal population totals are known. Raking was first used in the 1940 census to ensure that the complete census data and samples taken from it gave consistent results and was introduced in Deming and Stephan (1940); Brackstone and Rao (1976) further developed the theory. Oh and Schuren (1983) describe raking ratio estimates for nonresponse. 191 Consider the following table of sums of weights from a sample; each entry in the table is the sum of the sampling weights for persons in the sample falling in that classification (for example, the sum of the sampling weights for black females is 300). Female Male Sum of W eights Black W hite Asian Native American Other Sum of Weights 300150 1.2e+07 6090 3030 3030 16201380 450 2280 150 60 60 3000 Now suppose we know the true population counts for the marginal totals: we know that the population has 1510 women and 1490 men, 600 blacks, 2120 whites, 150 Asians, 100 Native Americans, and 30 persons in the “Other” category. The population counts for each cell in the table, however, are unknown; we do not know the number of black females in this population and cannot assume independence. Raking allows us to adjust the weights so that the sums of weights in the margins equal the population counts. First, adjust the rows. Multiply each entry by (true row population) / (estimated row population). Multiplying the cells in the female row by 1510/1620 and the cells in the male row by 1490/1380 results in the following table: Black W hite Asian Native American Other Sum of Weights Female Male 279.63 161.96 1118.52 1166.09 55.93 97.17 27.96 32.39 27.96 32.39 15101490 Sum of W eights 441.59 2284.61 153.1 60.35 60.35 3000 The row totals are fine now, but the column totals do not yet equal the population totals. Repeat the same procedure with the columns in the new table. The entries in the first column are each multiplied by 600/441.59. The following table results: Female Male Sum of W eights Black W hite Asian Native American Other Sum of Weights 379.94 220.06 1037.93 1082.07 54.79 95.21 46.33 53.67 13.90 16.10 1532.90 1467.10 600 2120 150 100 30 3000 But this has thrown the row totals off again. Repeat the procedure until both row and column totals equal the population counts. The procedure converges as long as all cell counts are positive. In this example, the final table of adjusted counts is 192 Female Male Sum of W eights Black W hite Asian Native American Other Sum of Weights 375.59 224.41 1021.47 1098.53 53.72 96.28 45.56 54.44 13.67 16.33 15101490 600 2120 150 100 30 3000 The entries in the last table may be better estimates of the cell populations (that is, with smaller variance) than the original weighted estimates, simply because they use more information about the population. The weighting-adjustment factor for each white male in the sample is 1098.53/1080; the weight of each white male is increased a little to adjust for nonresponse and undercoverage. Likewise, the weights of white females are decreased because they are overrepresented in the sample. The assumptions for raking are the same as for post-stratification, with the additional assumption that the response probabilities depend only on the row and column and not on the particular cell. If the sample sizes in each cell are large enough, the raking estimator is approximately unbiased. Raking has some difficulties–the algorithm may not converge if some of the cell estimates are zero. There is also a danger of “overadjustment”–if there is little relation between the extra dimension in raking and the cell means, raking can increase the variance rather than decrease it. 12.5.3 Estimating the Probability of Response: Other Methods Some weighting-class methods use weights that are the reciprocal of the estimated probability of response. A famous example is the Politz-Simmons method for adjusting for nonavailability of sample members. Suppose all calls are made during Monday through Friday evenings. Each nonrespondent is asked whether he or she was at home, at the time of the interview, on each of the four preceding weeknights. The respondent replies that she was home k of the four nights. It is then assumed that the probability of response is proportional to the number of nights at home during interviewing hours, so the probability of response is estimated by The sampling weight wi for each respondent is then multiplied by 5/(ki + 1). The respondents with k = 0 were home on only one of the five nights and are assigned to represent their share of the population plus the share of four persons in the sample who were called on one of their “unavailable” nights. The respondents most likely to be home have k = 4; it is presumed that all persons in the sample who were home every night were reached, so their weights are unchanged. The estimate of the population mean is This method of weighting–described by Hartley (1946) and Politz and Simmons (1949)–is based on 193 the premise that the most accessible persons will tend to be overrepresented in the survey data. The method is easy to use, theoretically appealing, and can be used in conjunction with callbacks. But it still misses people who were not at home on any of the five nights or who refused to participate in the survey. Because nonresponse is due largely to refusals in some telephone surveys, the PolitzSimmons method may not be helpful in dealing with all nonresponse. Values of k may also be in error, because people may err when recalling how many evenings they were home. Potthoff et al. (1993) modified and extended the Politz-Simmons method to determine weights based on the number of callbacks needed, assuming that the Ni’s follow a beta distribution. 12.5.4. A Caution About Weights The models for weighting adjustments for nonresponse are strong: in each weighting cell, the respondents and nonrespondents are assumed to be similar. Each individual in a weighting class is assumed equally likely to respond to the survey, regardless of the value of the response. These models never exactly describe the true state of affairs, and you should always consider their plausibility and implications. It is an unfortunate tendency of many survey practitioners to treat the weighting adjustment as a complete remedy and to then act as though there was no nonresponse. Weights may improve many of the estimates, but they rarely eliminate all nonresponse bias. If weighting adjustments are made (and remember, making no adjustments is itself a model about the nature of the nonresponse), practitioners should always state the assumed response model and give evidence to justify it. Weighting adjustments are usually used for unit nonresponse, not for item nonresponse (which would require a different weight for each item). 12.6 Imputation Missing items may occur in surveys for several reasons: an interviewer may fail to ask a question; a respondent may refuse to answer the question or cannot provide the information; a clerk entering the data may skip the value. Sometimes, items with responses are changed to missing when the data set is edited or cleaned–a data editor may not be able to resolve the discrepancies for an individual 3year old who voted in the last election and may set both values to missing. Imputation is commonly used to assign values to the missing items. A replacement value, often from another person in the survey who is similar to the item nonrespondent on other variables, is imputed for the missing value. When imputation is used, an additional variable that indicates whether the response was measured or imputed should be created for the data set. Imputation procedures are used not only to reduce the nonresponse bias but to produce a “clean,” rectangular data set–one without holes for the missing values. We may want to look at tables for subgroups of the population, and imputation allows us to do that without considering the item nonresponse separately each time we construct a table. Some references for imputation include Sande (1983) and Kalton and Kasprzyk (1982; 1986). Example 12.7 The Current Population Survey (CPS) has an overall high household response rate (typically well 194 above 90%), but some households refuse to answer certain questions. The nonresponse rate is about 20% on many income questions. This nonresponse would create a substantial bias in any analysis unless some corrective action were taken: various studies suggest that the item nonresponse for the income items is highest for low-income and high-income households. Imputation for the missing data makes it possible to use standard statistical techniques such as regression without the analyst having to treat the nonresponse by using specially developed methods. For surveys such as the CPS, if imputation is to be done, the agency collecting the data has more information to guide it in filling the missing values than does an independent analyst, because identifying information is not released on the public-use tapes. The CPS uses weighting for noninterview adjustment and hot-deck imputation for item nonresponse. The sample is divided into classes using variables sex, age, race, and other demographic characteristics. If an item is missing, a corresponding item from another unit in that class is substituted. Usually, hot-deck imputation is done by taking the value of the missing item from a household that is similar to the household with the missing item in some other explanatory variable such as family size. We use the small data set in Table 12.3 to illustrate some of the different methods for imputation. This artificial data set is only used for illustration; in practice, a much larger data set is needed for imputation. A “1" means the respondent answered yes to the question. Table 12.3 Small Data Set Used to Illustrate Imputation Methods Person Age Sex Years of Education Crime Victim? Violent-Crime Victim? 1.235e+30 5e+39 1e+19 16 ? 11 ? 12 ? 20 12 13 10 12 12 11 16 14 11 14 10 12 10 0 1 0 1 1 0 1 0 0 ? 0 0 1 1 0 0 0 0 ? 0 0 1 0 1 1 0 ? 0 ? ? 0 0 ? 0 0 0 0 0 0 0 195 12.6.1. Deductive Imputation Some values may be imputed in the data editing, using logical relations among the variables. In Table 12.3, person 9 is missing the response for whether she was a victim of violent crime. But she had responded that she was not a victim of any crime, so the violent-crime response should be changed to 0. Deductive Imputation may sometimes be used in longitudinal surveys. If a woman has two children in year 1 and two children in year 3, but is missing the value for year 2, the logical value to impute would be 2. 12.6.2. Cell Mean Imputation Respondents are divided into classes (cells) based on known variables, as in weighting-class adjustments. Then, the average of the values for the responding units in cell c, is substituted for each missing value. Cell mean imputation assumes that missing items are missing completely at random within the cells. Example 12.8 The four cells for our example are constructed using the variables age and sex. (In practice, of course, you would want to have many more individuals in each cell.) Age Sex M Persons 3, 5, 10, 14 Persons 1, 7, 8, 15, 16 F Persons 4, 12, 13, 19, 20 Persons 2, 6, 9, 11, 17, 18 Persons 2 and 6, missing the value for years of education, would be assigned the mean value for the four women aged 35 or older who responded to the question: 12.25. The mean for each cell after imputation is the same as the mean of the respondents. The imputed value, however, is not one of the possible responses to the question about education. Mean imputation gives the same point estimates for means, totals, and proportions as the weightingclass adjustments. Mean imputation methods fail to reflect the variability of the nonrespondents, however–all missing observations in a class are given the same imputed value. The distribution of y will be distorted because of a “spike” at the value of the sample mean of the respondents. As a consequence, the estimated variance in the subclass will be too small. To avoid the spike, a stochastic cell mean imputation could be used. If the response variable were approximately normally distributed, the missing values could be imputed with a randomly generated 196 value from a normal distribution with mean and standard deviation Mean imputation, stochastic or otherwise, distorts relationships among different variables because imputation is done separately for each missing item. Sample correlations and other statistics are changed. Jinn and Sedransk (1989a; 1989b) discuss the effect of different imputation methods on secondary data analysis–for instance, for estimating a regression slope. 12.6.3. Hot-Deck Imputation In hot-deck imputation, as in cell mean imputation and weighting-adjustment methods, the sample units are divided into classes. The value of one of the responding units in the class is substituted for each missing response. Often, the values for a set of related missing items are taken from the same donor, to preserve some of the multivariate relationships. The name hot deck is from the days when computer programs and data sets were punched on cards–the deck of cards containing the data set being analyzed was warmed by the card reader, so the term hot deck was used to refer to imputations made using the same data set. Fellegi and Holt (976) discuss methods for data editing and hot-deck imputation with large surveys. How is the donor unit to be chose? Several methods are possible. Sequential Hot-Deck Imputation Some hot-deck imputation procedures impute the value in the same subgroup that was last read by the computer. This is partly a carryover from the card days of computers (imputation could be done in one pass) and partly a belief that, if the data are arranged in some geographic order, adjacent units in the same subgroup will tend to be more similar than randomly chosen units in the subgroup. One problem with using the value on the previous “card” is that often nonrespondents also tend to occur in clusters, so one person may be a donor multiple times, in a way that the sampler cannot control. One of the other hot-deck imputation methods is usually used today for most surveys. In our example, person 19 is missing the response for crime victimization. Person 13 had the last response recorded in her subclass, so the value 1 is imputed. Random Hot-Deck Imputation A donor is randomly chosen from the persons in the cell with information on all missing items. To preserve multivariate relationships, usually values from the same donor are used for all missing items of a person. In our small data set, person 10 is missing both variables for victimization. Persons 3, 5, and 14 in his cell have responses for both crime questions, so one of the three is chosen randomly as the donor. In this case, person 14 is chosen, and his values are imputed for both missing variables. Nearest-Neighbor Hot-Deck Imputation Define a distance measure between observations, and impute the value of a respondent who is “closest” to the person with the missing item, where closeness is defined using the distance function. If age and sex are used for the distance function, so that the person of closest age with the same sex 197 is selected to be the donor, the victimization responses of person 3 will be imputed for person 10. 12.6.4. Regression Imputation Regression imputation predicts the missing value by using a regression of the item of interest on variables observed for all cases. A variation is stochastic regression imputation, in which the missing value is replaced by the predicted value from the regression model, plus a randomly generated error term. We only have 18 complete observations for the response crime victimization (not really enough for fitting a model to our data set), but a logistic regression of the response with explanatory variable age gives the following model for predicted probability of victimization, The predicted probability of being a crime victim for a 17-year old is 0.74; because that is greater than a predetermined cutoff of 0.5, the value 1 is imputed for person 10. Example 12.9 Paulin and Ferraro (1994) discuss regression models for imputing income in the U. S. Consumer Expenditure Survey. Households selected for the interview component of the survey are interviewed each quarter for five consecutive quarters; in each interview, they are asked to recall expenditures for the previous 3 months. The data are used to relate consumer expenditures to characteristics such as family size and income; they are the source of reports that expenditures exceed income in certain income classes. The Consumer Expenditure Survey conducts about 5000 interviews each year, as opposed to about 60,000 for the NCVS. This sample size is too small for hot-deck imputation methods, as it is less likely that suitable donors will be found for nonrespondents in a smaller sample. If imputation is to be done at all, a parametric model needs to be adopted. Paulin and Ferraro used multiple regression models to predict the log of family income (logarithms are used because the distribution of income is skewed) from explanatory variables including total expenditures and demographic variables. These models assume that income items are MAR, given the covariates. 12.6.5. Cold-Deck Imputation In cold-deck imputation, the imputed values are from a previous survey or other information, such as from historical data. (Since the data set serving as the source for the imputation is not the one currently running through the computer, the deck is “cold.”) Little theory exists for the method. As with hot-deck imputation, cold-deck imputation is not guaranteed to eliminate selection bias. 198 12.6.6. Substitution Substitution methods are similar to cold-deck imputation. Sometimes interviewers are allowed to choose a substitute while in the field; if the household selected for the sample is not at home, they try next door. Substitution may help reduce some nonresponse bias, as the household next door may be more similar to the nonresponding household than would be a household selected at random from the population. But the household next door is still a respondent; if the nonresponse is related to the characteristics of interest, there will still be nonresponse bias. An additional problem is that, since the interviewer is given discretion about which household to choose, the sample no longer has known probabilities of selection. The 1975 Michigan Survey of Substance Abuse was taken to estimate the number of persons that used 16 types of substances in the previous year. The sample design was a stratified multistage sample with 2100 households. Three calls were made at a dwelling; then the house to the right was tried, then the house to the left. From the data, evidence shows that the substance-use rate increases as the required number of calls increases. Some surveys select designated substitutes at the same time the sample units are selected. If a unit does not respond, then one of the designated substitutes is randomly selected. The National Longitudinal Study (see National Center of Educational Statistics 1977) used this method. This stratified, multistage sample of the high school graduating class of 1972 was intended to provide data on the educational experiences, plans, and attitudes of high school seniors. Four high schools were randomly selected from each of 600 strata. Two were designated for the sample, and the other two were saved as backups in case of nonresponse. Of the 1200 schools designated for the sample, 948 participated, 21 had no graduating seniors, and 231 either refused or were unable to participate. Investigators chose 122 schools from the backup group to substitute for the nonresponding schools. Follow-up studies showed a consistent 5% bias in a number of estimated totals, which was attributed to the use of substitute schools and to nonresponse. Substitution has the added danger that efforts to contact the designated units may not be as great as if no “easy way out” was provided. If substitution is used, it should be reported in the results. 12.6.7. Multiple Imputation In multiple imputation, each missing value is imputed m ($2) different times. Typically, the same stochastic model is used for each imputation. These create m different “data” sets with no missing values. Each of the m data sets is analyzed as if no imputation had been done; the different results give the analyst a measure of the additional variance due to the imputation. Multiple imputation with different models for nonresponse can give an idea of the sensitivity of the results to particular nonresponse models. See Rubin (1987; 1996) for details on implementing multiple imputation. 12.6.8. Advantages and Disadvantages of Imputation Imputation creates a “clean,” rectangular data set that can be analyzed by standard software. Analyses of different subsets of the data will produce consistent results. If the nonresponse is missing at random given the covariates used in the imputation procedure, imputation substantially 199 reduces the bias due to item nonresponse. If parts of the data are confidential, the data collector can perform the imputation. The data collector has more information about the sample and population than is released to the public (for example, the collector may know the exact address for each sample member) and can often perform a better imputation using that information. The foremost danger of using imputation is that future data analysis will not distinguish between the original and the imputed values. Ideally, the imputer should record which observations are imputed, how many times each nonimputed record is used as a donor, and which donor was used for a specific response imputed to a recipient. The imputed values may be good guesses, but they are not real data. Variances computed using the data together with the imputed values are always too small, partly because of the artificial increase in the sample size and partly because the imputed values are treated as though they were really obtained in the data collection. The true variance will be larger than that estimated from a standard software package. Rao (1996) and Fay (1996) discuss methods for estimating the variances after imputation. 12.7 Parametric Models for Nonresponse Most of the methods for dealing with nonresponse assume that the nonresponse is ignorable–that is, conditionally on measured covariates, nonresponse is independent of the variables of interest. In this situation, rather than simply dividing units among different subclasses and adjusting weights, one can fit a superpopulation model. From the model, then, one predicts the values of the y’s not in the sample. The model fitting is often iterative. In a completely model-based approach, we develop a model for the complete data and add components to the model to account for the proposed nonresponse mechanism. Such an approach has many advantages over other methods: the modeling approach is flexible and can be used to include any knowledge about the nonresponse mechanism, the modeler is forced to state the assumptions about nonresponse explicitly in the model, and some of these assumptions can be evaluated. In addition, variance estimates that result from fitting the model account for the nonresponse, if the model is a good one. Example 12.10 Many people believe that spotted owls in Washington, Oregon, and California are threatened with extinction because timber harvesting in mature coniferous forests reduces their available habitat. Good estimates of the size of the spotted owl population are needed for reasoned debate on the issue. In the sampling plan described by Azuma et al. (1990), a region of interest is divided into N sampling regions (PSU’s), and an SRS of n PSU’s is selected. Let Assume that the Yi’s are independent and that P(Yi = 1) = p, the true proportion of occupied PSU’s. 200 If occupancy could be definitively determined for each PSU, the proportion of PSU’s occupied could be estimated by the sample proportion While a fix number of visits can establish that a PSU is occupied, however, a determination that a PSU is unoccupied may be wrong–some owl pairs are “nonrespondents,” and ignoring the nonresponse will likely result in a too-low estimate of percentage occupancy. Azuma et al. (1990) propose using a geometric distribution for the number of visits required to discover the owls in an occupied unit, thus modeling the nonresponse. The assumptions for the model are: (1) the probability of determining occupancy on the first visit, 0, is the same for all PSU’s, (2) each visit to a PSU is independent, and (3) visits can continue until an owl is sighted. A geometric distribution is commonly used for number of callbacks needed in surveys of people (see Potthoff et al. 1993). Let Xi be the number of visits required to determine whether PSU I is occupied or not. Under the geometric model, The budget of the U. S. Forest Service, however, does not allow for an infinite number of visits. Suppose a maximum of s visits are to be made to each PSU. The random variable Yi cannot be observed; the observable random variables are Here, counts the number of PSU’s observed to be occupied, and counts the total number of visits made to occupied units. Using the geometric model, the probability that an owl is first observed in PSU I on visit k (#s) is and the probability that an owl is observed on one of the s visits to PSU I is 201 Thus, the expected value of the sample proportion of occupied units, is and is less than the proportion of interest p if 0 < 1. The geometric model agrees with the intuition that owls are missed in the s visits. We find the maximum likelihood estimates of p and 0 under the assumption that all PSU’s are independent. The likelihood function is maximized when and when solves numerical methods are needed to calculate Maximum likelihood theory also allows calculation of the asymptotic covariance matrix of the parameter estimates. An SRS of 240 habitat PSU’s in California had the following results: Visit Number 1 2 3 4 5 6 Number of occupied PSU’s 33 17 12 7 7 5 A total of 81 PSU’s were observed to be occupied in six visits, so The average number of visits made to occupied units was Thus, the maximum likelihood estimates are and using the asymptotic covariance matrix from maximum likelihood theory, we estimate the variance of by 0.00137. Thus, an approximate 95% confidence interval for the proportions of units that are occupied is 0.370±0.072. Incorporating the geometric model for number of visits gave a larger estimate of the proportion of 202 units occupied. If the model does not describe the data, however, the estimate will still be biased; if the model is poor, may be a worse estimate of the occupancy rate than If, for example, field investigators were more likely to find owls on later visits because they accumulate additional information on where to look, the geometric model would be inappropriate. We need to check whether the geometric model adequately describes the number of visits needed to determine occupancy. Unfortunately, we cannot determine whether the model would describe the situation for units in which owls are not detected in six visits, as the data are missing. We can, however, use a goodness-of-fit test to see whether data from the six visits made are fit by the model. Under the model, we expect of the PSU’s to have owls observed on visit k, and we plug in our estimates of p and 0 to calculate expected counts: Visit Observed count Expected count 1 2 3 4 5-6 33 17 12 7 12 29.66 19.74 13.14 8.75 9.71 Total 81 80.99 Visits 5 and 6 were combined into one category so that the expected cell count would be greater than 5. The test statistic is 1.75, with p-value< 0.05. There is no indication that the model is inadequate for the data we have. We cannot check its adequacy for the missing data, however. The geometric model assumes observations are independent and that an occupied PSU would eventually be determined to be occupied if enough visits were made. We cannot check whether that assumption of the model is reasonable or not: if some wily owls will never be detected in any number of visits, will still be too small. To use models with nonresponse, you need (1) a thorough knowledge of mathematical statistics, (2) a powerful computer, and (3) knowledge of numerical methods for optimization. Commonly, maximum likelihood methods are used to estimate parameters, and the likelihood equations rarely have closed-form solutions. Calculation of estimates required numerical methods even for the simple model adopted for the owls, and that was an SRS with a simple geometric model for the response mechanism that allowed to easily write down the likelihood function. Likelihood 203 functions for more complex sampling designs or nonresponse mechanisms are much more difficult to construct (particularly if observations in the same cluster are considered dependent), and calculating estimates often requires intensive computations. Little and Rubin (1987) discuss likelihood-based methods for missing data in general. Stasny (1991) gives an example of using models to account for nonresponse. 12.8 What is an Acceptable Response Rate? Often an investigator will say, “I expect to get a 60% response rate in my survey. Is that acceptable and will the survey give me valid results?” As we have seen in this chapter, the answer to that question depends on the nature of the nonresponse: if the nonrespondents are MCAR, then we can largely ignore the nonresponse and use the respondents as a representative sample of the population. If the nonrespondents tend to differ from the respondents, then the biases in the results from using only the respondents may make the entire survey worthless. Many references give advice on cutoffs for acceptability of response rates. Babbie, for example, says: “I feel that a response rate of at least 50 percent is adequate for analysis and reporting. A response of at least 60 percent is good. And a response rate of 70 percent is very good” (1973, 165). I believe that giving such absolute guideline for acceptable response rates is dangerous and has led many survey investigators to unfounded complacency about nonresponse; many examples exist of surveys with a 70% response rate whose results are flawed. The NCVS needs corrections for nonresponse bias even with a response rate of about 95%. Be aware that response rates can be manipulated by defining them differently. Researchers often do not say how the response rate was calculated or may use an estimate of response rate that is smaller than it should be. Many surveys inflate the response rate by eliminating units that could not be located from the denominator. Very different results for response rate accrue, depending on which definition of response rate is used; all of the following have been used in surveys: Number of completed interviews Number of units in sample Number of completed interviews Number of units contacted completed interviews + ineligible units contacted units completed interviews Contacted units - (ineligible units) 204 Number of completed interviews Number of units in sample completed interviews contacted units - (ineligible units) - refusals Note that a “response rate” calculated using the last formula will be much higher than one calculated using the first formula because the denominator is smaller. The guidelines for reporting response rates in Statistics Canada (1993) and Hidiroglou et al (1993) provide a sensible solution for reporting response rates. They define in-scope units as those that belong to the target population, and resolved units as those units for which it is known whether or not they belong to the target population.3 They suggest reporting a number of different response rates for a survey including the following: # Out-of-scope rate: the ratio of the number of out-of-scope units to the number of resolved units # No-contact rate: the ratio of the number of no-contacts and unresolved units to the number of in-scope and unresolved units # Refusal rate: the ratio of number of refusals to the number of in-scope units # Nonresponse rate: the ratio of number of nonrespondent and unresolved units to the number of in-scope and unresolved units Different measures of response rates may be appropriate for different surveys, and I hesitate to recommend one “fits-all” definition of response rate. The quantities used in calculating response rate, however, should be defined for every survey. The following recommendations from the U. S. Office of Management and Budget’s Federal Committee on Statistical Methodology, reported in Gonzales et al. (1994), are helpful: Recommendation 1. Survey staffs should compute response rates in a uniform fashion over time and document response rate components on each edition of a survey. Recommendation 2. Survey staffs for repeated surveys should monitor response rate components (such as refusals, not-at-homes, out-of-scopes, address not locatable, post-master returns, etc.) over time, in conjunction with routine documentation of cost and design changes. 3 If, for example, the target population is residential telephone numbers, it may be impossible to tell whether or not a telephone that rings but is not answered belongs to the target population; such a number would be an unresolved unit. 205 Recommendation 3. Response rate components should be published in survey reports, readers should be given definitions of response rates used, including actual counts, and commentary on the relevance of response rates to the quality of the survey data. Recommendation 4. Some research on nonresponse can have real payoffs. It should be encouraged by survey administrators as a way to improve the effectiveness of data collection operations. 206 Annex 1 Many surveys have more than one of these problems. The Literary Digest (1932, 1936a, b, c) began taking polls to forecast the outcome of the U. S. presidential election in 1912, and their polls attained a reputation for accuracy because they forecast the correct winner in every election between 1912 and 1932. In 1932, for example, the poll predicted that Roosevelt would receive 56% of the popular vote and 474 votes in the electoral college; in the actual election, Roosevelt received 58% of the popular vote and 472 votes in the electoral college. With such a strong record of accuracy, it is not surprising that the editors of the Literary Digest had a great deal of confidence in their polling methods by 1936. Launching the 1936 poll, they said: The Poll represents thirty years’ constant evolution and perfection. Based on the “com m ercial sam pling” m ethods used for m ore than a century by publishing houses to push book sales, the present m ailing list is drawn from every telephone book in the United States, from the rosters of clubs and associations, from city directories, lists of registered voters, classified m ail-order and occupational data. (1936a. 3). On October 31, the poll predicted that Republican Alf Landon would receive 55% of the popular vote, compared with 41% for President Roosevelt. The article “Landon, 1,293,669; Roosevelt, 972,897: Final Returns in The Digest’s Poll of Ten Million Voters” contained the statement: “We make no claim to infallibility. We did not coin the phrase ‘uncanny accuracy’ which has been so freely applied to our Polls” (1936b). It is a good thing they made no claim to infallibility; in the election, Roosevelt received 61% of the vote; Landon, 37%. What went wrong? One problem may have been the undercoverage in the sampling frame, which relied heavily on telephone directories and automobile registration lists–the frame was used for advertising purposes, as well as for the poll. Households with a telephone or automobile in 1936 were generally more affluent than other households, and opinion of Roosevelt’s economic policies was generally related to the economic class of the respondent. But sampling frame bias does not explain all the discrepancy. Postmortem analyses of the poll by Squire (1988) and Calahan (1989) indicate that even persons with both a car and a telephone tended to favor Roosevelt, though not to the degree that persons with neither car nor telephone supported him. The low response rate to the survey was likely the source of much of the error. Ten million questionnaires were mailed out, and 2.3 million were returned–an enormous sample but a response rate of less than 25%. In Allentown, Pennsylvania, for example, the survey was mailed to every registered voter, but the survey results for Allentown were still incorrect because only one-third of the ballots were returned. Squire (1988) reports that persons supporting Landon were much more likely to have returned the survey; in fact, may Roosevelt supporters did not even remember receiving a survey, even though they were on the mailing list. One lesson to be learned from the Literary Digest poll is that the sheer size of a sample is no guarantee of its accuracy. The Digest editors became complacent because they sent out questionnaires to more than one quarter of all registered voters and obtained a huge sample of 2.3 million people. But large unrepresentative samples can perform as badly as small unrepresentative samples. A large unrepresentative sample may do more damage than a small one because many 207 people think that large samples are always better than small ones. The design of the survey is far more important than the absolute size of the sample. What good are samples with selection bias? We prefer to have samples with no selection bias, that serve as a microcosm of the population. When the primary interest is in estimating the total number of victims of violent crime in the United States or the percentage of likely voters in the United Kingdom who intend to vote for the Labour Party in the next election, serious selection bias can cause the sample estimates to be invalid. Purposive of judgment samples can provide valuable information, though, particularly in the early stages of an investigation. Teichman et al. (1993) took soil samples along Interstate 880 in Alameda County, California, to determine the amount of lead in yards of homes and in parks close to the freeway. In taking the samples, they concentrated on areas where they thought children were likely to play and areas where soil might easily be tracked into homes. The purposive sampling scheme worked well for justifying the conclusion of the study, that “lead contamination of urban soil in the east bay area of the San Francisco metropolitan area is high and exceeds hazardous waste levels at many sites.” A sampling scheme that avoided selection bias would only be needed for this study if the investigators wanted to generalize the estimated percentage of contaminated sites to the entire area. 208 Annex 2 Shere Hite’s book Women and Love: A Cultural Revolution in Progress (1987) had a number of widely quoted results: • • • • 84% of women are “not satisfied emotionally with their relationships” (p.804). 70% of all women “married five or more years are having sex outside of their marriages” (p. 856). 95% of women “report forms of emotional and psychological harassment from men with whom they are in love relationships” (p. 810). 84% of women report forms of condescension from the men in their love relationships (p. 809). The book was widely criticized in newspaper and magazine articles throughout the United States. The Time magazine cover story “Back Off, Buddy” (October 12, 1987), for example, called the conclusions of Hite’s study “dubious” and “of limited value.” Why was Hite’s study so roundly criticized? Was it wrong for Hite to report the quotes from women who feel that the men in their lives refuse to treat them as equals, who perhaps have never been given the chance to speak out before? Was it wrong to report the percentages of these women who are unhappy in their relationships with men? Of course not. Hite’s research allowed women to discuss how they viewed their experiences, and reflected the richness of these women’s experiences in a way that a multiple-choice questionnaire could not. Hite’s error was in generalizing these results to all women, whether they participated in the survey or not, and in claiming that the percentages applied to all women. The following characteristics of the survey make it unsuitable for generalizing the results to all women. • The sample was self-selected–that is, recipients of questionnaires decided whether they would be in the sample or not. Hite mailed 100,000 questionnaires; of these, 4.5% were returned. • The questionnaires were mailed to such organizations as professional women’s groups, counseling centers, church societies, and senior citizens’ centers. The members may differ in political views, but many have joined an “all-women” group, and their viewpoints may differ from other women in the United States. • The survey has 127 essay questions, and most of the questions have several parts. Who will tend to return the survey? • Many of the questions are vague, using words such as love. The concept of love probably has as many interpretations as there are people, making it impossible to attach a single interpretation to any statistic purporting to state how many women are “in love.” Such question wording works well for eliciting the rich individual vignettes that comprise most of the book but makes interpreting percentages difficult. 209 • Many of the questions are leading–they suggest to the respondent which response she should make. For instance: “Does your husband/lover see you as an equal? Or are there times when he seems to treat you as an inferior? Leave you out of the decisions? Act superior?” (p. 795). Hite writes, “Does research that is not based on a probability or random sample give one the right to generalize from the results of the study to the population at large? If a study is large enough and the sample broad enough, and if one generalizes carefully, yes” (p. 778). Most survey statisticians would answer Hite’s questions with a resounding no. In Hite’s survey, because the women sent questionnaires were purposefully chosen and an extremely small percentage of the women returned the questionnaires, statistics calculated from these data cannot be used to indicate attitudes of all women in the United States. The final sample is not representative of women in the United States, and the statistics can only be used to describe women who would have responded to the survey. Hite claims that results from the sample could be generalized because characteristics such as the age, educational, and occupational profiles of women in the sample matched those for the population of women in the United States. But the women in the sample differed on one important aspect–they were willing to take the time to fill out a long questionnaire dealing with harassment by men and to provide intensely personal information to a researcher. We would expect that in every age group and socioeconomic class, women who choose to report such information would in general have had different experiences than women who choose not to participate in the survey. 210 Annex 3 2.7 Randomization Theory Results for Simple Random Sampling As we have seen before, is an unbiased estimator of where the latter is the average of all possible values of if we could examine all possible SRSs of S that could be chosen. We also calculate the variance of given by: which can be estimated by the unbiased estimator given by: No distributional assumptions are made about the yi’s in order to ascertain that is unbiased for estimating We do not, for instance, assume that the yi’s are normally distributed with mean :. In the randomization theory (also called design-based) approach to sampling, the yi’s are considered to be fixed by unknown numbers–any probabilities used arise from the probabilities of selecting units to be in the sample. The randomization theory approach provides a nonparametric approach to inference–we need not make any assumptions about the distribution of random variables. Let’s see how the randomization theory works for deriving properties of the sample mean in simple random sampling. As done in Cornfield (1944), define Then The Zi’s are the only random variables in the above equation because, according to randomization theory, the yi’s are fixed quantities. When we choose an SRS of n units out of the N units in the population, {Z1, . . . , ZN} are identically distributed Bernoulli random variables with 211 (2.18) The probability in (2.18) follows from the definition of an SRS. To see this, note that if unit I is in the sample, then the other (n – 1) units in the sample must be chosen from the (N - 1) units in the population. A total of possible samples of size (n - 1) may be drawn from a population of size (N - 1), so As a consequence of Equation (2.18), and The variance of is also calculated using properties of the random variables Z1, . . . , ZN. Note that For Because the population is finite, the Zi’s are not quite independent–if we know that unit I is in the sample, we do have a small amount of information about whether unit j is in the sample, reflected in the conditional probability P(Zj = 1 * Zi = 1). Consequently, for 212 We use the covariance (Cov) of Zi and Zj to calculate the variance of see Appendix B for properties of covariances. The negative covariance of Zi and Zj is the source of the fpc. To show that the estimator of is an unbiased estimator we need to show that the E[s2] = S2. The argument proceeds much like the previous one. Since it makes sense when trying to find an unbiased estimator to find the 213 expected value of and then find the multiplicative constant that will give the unbiasedness: Thus, 2.8 A Model for Simple Random Sampling Unless you have studied randomization theory in the design of experiments, the proofs in the preceding section probably seemed strange to you. The random variables in randomization theory are not concerned with the responses yi: they are simply random variables that tell us whether the ith unit is in the sample or not. In a design-based, or randomization theory, approach to sampling inference, the only relationship between units sampled and units not sampled is that the nonsampled units could have been sampled had we used a different starting value for the random number generator. In Section 2.7 we found properties of the sample mean were considered to be fixed values, and using randomization theory: Y1, Y2, . . . , YN is unbiased because the average of for all possible samples S equals The only probabilities used in finding the expected value and variance of are the probabilities used in finding the expected value and variance of are the probabilities that units are included in the sample. In your basic statistics class, you learned a different approach to inference. There, you had random variable {Yi} that followed some probability distribution, and the actual sample values were realizations of those random variables. Thus you assumed, for example, that y1, y2, . . ., yn were independent and identically distributed from a normal distribution with mean : and variance F2 and used properties of independent random variables and the normal distribution to find expected values of various statistics. 214 We can extend this approach to sampling by thinking of random variables y1, y2, . . ., yn generated from some model. The actual values for the finite population is that the sample is one realization of the random variables. The joint probability distribution of Y1, Y2, . . . , YN supplies the link between units in the sample and units not in the sample in this model-based approach–a link that is missing in the randomization approach. Here, we sample Thus, problems in finite population sampling may be thought of as prediction problems. 215 CHAPTER 13 VARIANCE ESTIMATION IN COMPLEX SURVEYS _________________________________________________________________________ Population means and totals are easily estimated using weights. Estimating variances is more intricate. We noted before that in a complete survey with several levels of stratification and clustering, variances for estimated means and totals are calculated at each level and then combined as the survey design is ascended. Poststratification and nonresponse adjustment also affect the variance. In previous chapters, we have presented and derived variance formulas for a variety of sampling plans. Some of the variance formulas, such as those for simple random samples (SRSs), are relatively simple. Other formulas, such as from a two-stage cluster sample without replacement, are more complicated. All work for estimating variances of estimated totals. But we often want to estimate other quantities from survey data for which we have presented no variance formula. For example, in Chapter 3 we derived an approximate variance for a ratio of two means when an SRS is taken. What if you want to estimate a ratio, but the survey is not as SRS? How would you estimate the variance? This chapter describes several methods for estimating variances of estimated totals and other statistics from complex surveys. Section 13.1 describes the commonly used linearization method for calculating variances of nonlinear statistics. Sections 13.2 and 13.3 present random group and resampling methods for calculating variances of linear and nonlinear statistics. Section 13.4 describes the calculation of generalized variances functions, and Section 13.5 describes constructing confidence intervals. These methods are described in more detail by Wolter (1985) and Rao (1988); Rao (1997) and Rust and Rao (1996) summarize recent work. 13.1 Linearization (Taylor Series) Methods Most of the variance formulas in Chapters 2 through 6 were for estimates of means and totals. Those formulas can be used to find variances for any linear combination of estimated means and totals. If are unbiased estimates of k totals in the population, then (13.1) The result can be expressed equivalently using unbiased estimates of k means in the population: Thus, if T1 is the total number of dollars robbery victims reported stolen, T2 is the number of days of work robbery victims missed because of the crime, and T3 is the total medical expenses incurred by robbery victims, one measure of financial consequences of robbery (assuming $150 per day of work lost) might be By (13.1), the variance is: This expression requires calculation of six variances and covariances; it is easier computationally to define a new variable at the observation unit level. and find directly. Suppose, though, that we are interested in the proportion of total loss accounted for by the stolen property, T1/Tq. This is not a linear statistic, as T1/Tq cannot be expressed in the form a1T1+a2Tq for constants ai. But Taylor’s theorem from calculus allows us to linearize a smooth nonlinear function h(T1, T2, . . . , Tk) of the population totals; Taylor’s theorem gives the constants ao, a1, . . . , ak so that Then may be approximated by which we know how to calculate using (13.1). Taylor series approximations have long been used in statistsics to calculate approximate variances. Woodruff (1971) illustrates their use in complex surveys. Binder (1983) gives a more rigorous treatment of Taylor series methods for complex surveys and tells how to use linearization when the parameter of interest 2 solves h(2, T1, . . ., Tk) = 0, but 2 is not necessarily expressed as an explicit function of T1, . . ., Tk. Example 13.1 The quantity 2 = P(1-P), where P is a population proportion, may be estimated by Assume that p is an unbiased estimator of P and that V(p) is known. Let h(x) = x(1-x), so 2 = h(p) and Now h is a nonlinear function of x, but the function can be approximated at any nearby point a by the tangent line to the function; the slope of the tangent line is given by the derivative, as illustrated in Figure 13.1. 217 Figure 13.1 The function h(x) = x(1-x), along with the tangent to the function at point P. If p is close to P, the h(p) will be close to the tangent line. The slope of the tangent line is h’(P) = 1 - 2P. The first-order version of Taylor’s theorem states that if the second derivative of h is continuous, then under conditions commonly satisfied in statistics, the last term is small relative to the first two, and we use the approximation Then, and V(p) is known, so the approximate variance of h(p) can be calculated. The following are the basic steps for constructing a linearization estimator of the variance of a nonlinear function of means or totals: 1. Express the quantity of interest as a function of means or totals of variables measured or computed in the sample. In general, 2 = h(T1, T2, . . . , Tk) or In Example 13.1, 218 2. Find the partial derivatives of h with respect to each argument. The partial derivatives, evaluated at the population quantities, for the linearizing constants ai. 3. Apply Taylor’s theorem to linearize the estimate: Where 4. Define the new variable q by Now find the estimated variance of This will generally approximate the variance of h(T1, T2, . . . , Tk). Example 13.2 We used linearization methods to approximate the variance of the ratio and regression estimators in Chapter 3. In Chapter 3, we used an SRS, estimator and the approximation The resulting approximation to the variance was Essentially, we used Taylor’s theorem to obtain this approximation. The steps below give the same result. 1. Express B as a function of the population totals. Let h(c,d) = d/c, so 219 Assume that the sample estimates 2. are unbiased. The partial derivatives are Evaluated at c = Tx and d = Ty, these are 3. By Taylor’s Theorem, Using the partial derivatives from step 2, 4. The approximate mean squared error of is (13.2) 220 We can substitute values for B, for the variances and covariance, and possibly for Tx from the particular sampling scheme used into (13.2). Alternatively, we would define And find If the sampling design is an SRS of size n, then and Advantages If the partial derivatives are known, linearization almost always gives a variance estimate for a statistic and can be applied in general sampling designs. Linearization methods have been used for a long time in statistics, and the theory is well developed. Software exists for calculating linearization variance estimates for many nonlinear functions of interest, such as ratios and regression coefficients; some software will be discussed in Section 13.6. Disadvantages Calculations can be messy, and the method is difficult to apply for complex functions involving weights. You must either find analytical expressions for the partial derivatives of h or calculate the partial derivatives numerically. A separate variance formula is needed for each nonlinear statistic that is estimated, and that can require much special programming; a different method is needed for each statistic. In addition, not all statistics can be expressed as a smooth function of the population totals–the median and other quantiles, for example, do not fit into this framework. The accuracy of the linearization approximation depends on the sample size–the estimate of the variance is often biased downward if the sample is not large enough. 13.2 Random Group Methods 13.2.1. Replicating the Survey Design Suppose the basic survey design is replicated independently R times. Independently here means that after each sample is drawn, the sampled units are replaced in the population so that they are available for later samples. Then, the R replicate samples produce R independent estimates of the quantity of 221 interest; the variability among those estimates can be used to estimate the variance of Mahalanobis (1946) describes early uses of the method, which he calls “replicated networks of sample units” and “interpenetrating sampling.” Let If is an unbiased estimate of 2, so is and (13.3) is an unbiased estimate of Note that is the sample variance of the R independent estimates of 2 divided by R–the usual estimate of the variance of a sample mean. Example 13.3 The 1991 Information Please Almanac listed enrollment, tuition, and room-and-board costs for every 4-year college in the United States. Suppose we want to estimate the ratio of nonresident tuition to resident tuition for public colleges and universities in the United States. In a typical implementation of the random group method, independent samples would be chosen using the same design and found for each sample. Let’s take four SRSs of size 10 each (Table 13.1). The four SRS are without replacement, but the same college can appear in more than one of the four SRSs. For this example, 222 Table 13.1: Four SRSs of Colleges, Used in Example 13.3 College Enrollment Resident Tuition Nonresident Tuition 3.483e+41 1.3651677e+38 3,747 4,983 1,500 2,160 2,475 5.135 3,950 4,050 4,140 4,166 Average 6934.2 1559 3630.6 College Enrollment Resident Tuition Nonresident Tuition 4.696e+41 1.4951350e+39 4.0953342474e+39 Average 6968.6 1505.2 3883.7 College Enrollment Resident Tuition Nonresident Tuition 4.092e+39 9.4117101e+37 2.7654140288e+39 Average 4790.2 1527.5 3756.3 College Enrollment Resident Tuition Nonresident Tuition 6.398e+42 1.6741296e+37 5.7123792759e+39 8613 1527.1 4750.8 Columbus College Southeastern Massachusetts University U. S. Naval Academy Athens State College University of South Alabama Virginia State University SUNY College of Technology-Farmingdale University of Houston CUNY-Lehman College Austin Peay State University SUNY-New Paltz Indiana University-Southeast University of Wisconsin-Platteville University of California-Santa Barbara W eber State College Kennesaw College South Dakota State University Dickinson State University Chadron State College University of Alaska-Fairbanks University of Alaska-Anchorage University of Maine-Fort Kent Southern University-Baton Rouge University of Oregon Virginia State University Glenville State College W inston-Salem State University Framingham State College SUNY-Old W estbury Northwest Missouri State University Central W ashington University W orcester State College University of California-Davis Sam Houston State University University of Texas-Tyler Southerneastern Oklahoma State University University of Southern Colorado Pennsylvania State University East Central University Univ of Arkansas-Monticello Average 223 Thus, The sample average of the four independent estimates of 2 is The sample standard deviation (SD) of the four estimates is 0.343, so the standard error (SE) of is The estimated variance is based on four independent observations, so a 95% confidence interval (CI) for the ratio is 2.6198 ± 3.18 (0.172) where 3.18 is the appropriate t critical value with 3 degrees of freedom (df). Note that the small number of replicates causes the confidence interval to be wider than it would be if more replicate samples were taken, because the estimate of the variance with 3 df is not very stable. 13.2.2. Dividing the Sample into Random Groups In practice, subsamples are not usually drawn independently, but the complete sample is selected according to the survey design. The complete sample is then divided into R groups so that each group forms a miniature version of the survey, mirroring the sample design. The groups are then treated as though they are independent replicates of the basic survey design. If the sample is an SRS of size n, the groups are formed by randomly apportioning the n observations into R groups, each of size n/R. These pseudo-random groups are not quite independent replicates because an observation unit can only appear in one of the groups; if the population size is large relative to the sample size, however, the groups can be treated as though they are independent replicates. In a cluster sample, the PSUs are randomly divided among the R groups. The PSU takes all its observations units with it to the random group, so each random group is still a cluster sample. In a stratified multistage sample, a random group contains a sample of PSUs from each stratum. Note that if k PSUs are sampled in the smallest stratum, at most k random groups, can be formed. If 2 is a nonlinear quantity, will not, in general, be the same as the estimator calculated directly from the complete sample. For example, in ratio estimation, Usually, is a more natural estimator than Sometimes while from (13.3) is used to estimate although it is an overestimate. Another estimator of the variance is slightly larger but is often used: (13.4) Example 13.4 The 1987 Survey of Youth in Custody, discussed in Example 7.4, was divided into seven random 224 groups. The survey design had 16 strata. Strata 6-16 each consisted of one facility (= PSU), and these facilities were sampled with probability 1. In strata 1-5, facilities were selected with probability proportional to number of residents in the 1985 Children in Custody census. It was desired that each random group be a miniature of the sampling design. For each selfrepresenting facility in strata 6-16, random group numbers were assigned as follows: the first resident selected from the facility was assigned a number between 1 and 7. Let’s say the first resident was assigned number 6. Then the second resident in that facility would be assigned number 7, the third resident 1, the fourth resident 2, and so on. In strata 1-5, all residents in a facility (PSU) were assigned to the same random group. Thus, for the seven facilities sampled in stratum 2, all residents in facility 33 were assigned random group number 1, all residents in facility 9 were assigned random group 2 (etc.). Seven random groups were formed because strata 2-5 each have seven PSUs. After all random group assignments were made, each random group had the same basic design as the original sample. Random group 1, for example, forms a stratified sample in which a (roughly) random sample of residents is taken from the self-representing facilities in strata 6-16, and a pps (probability proportional to size) sample of facilities is taken from each of strata 1-5. To use the random group method to estimate a variance, is calculated for each random group. The following table shows estimates of mean age of residents for each random group; each estimate was calculated using where wi is the final weight for resident I and the summations are over observations in random group r. Random Group Number Estimate of Mean Age, 1234567 The seven estimates, 16.55 16.66 16.83 16.06 16.32 17.03 17.27 are treated as independent observations, so and 225 Using the entire data set, we calculate We can use either with or to calculate confidence intervals; using a 95% CI for mean age is (2.45 is the t critical value with 6 df). Advantages No special software is necessary to estimate the variance, and it is very easy to calculate the variance estimate. The method is well suited to multiparameter or nonparametric problems. It can be used to estimate variances for percentiles and nonsmooth functions, as well as variances of smooth functions of the population totals. Random group methods are easily used after weighting adjustments for nonresponse and undercoverage. Disadvantages The number of random groups is often small–this gives imprecise estimates of the variances. Generally, you would like at least ten random groups to obtain a more stable estimate of the variance and to avoid inflating the confidence interval by using the t distribution rather than the normal distribution. Setting up the random groups can be difficult in complex designs, as each random group must have the same design structure as the complete survey. The survey design may limit the number of random groups that can be constructed; if two PSUs are selected in each stratum, then only two random groups can be formed. 13.3 Resampling and Replication Methods Random group methods are easy to compute and explain but are unstable if a complex sample can only be split into a small number of groups. Resampling methods treat the sample as if it were itself a population; we take different samples from this new “population” and use the subsamples to estimate a variance. All methods in this section calculate variance estimates for a sample in which PSUs are sampled with replacement. If PSUs are sampled without replacement, these methods may still be used but are expected to overestimate the variance and result in conservative confidence intervals. 13.3.1. Balanced Repeated Replication (BRR) Some surveys are stratified to the point that only two PSUs are selected from each stratum. This 226 gives the highest degree of stratification possible while still allowing calculation of variance estimates in each stratum. 13.3.1.1. BRR In a Stratified Random Sample We illustrate BRR for a problem we already know how to solve–calculating the variance for from a stratified multistage sample. More complicated statistics from stratified multistage samples are discussed in Section 13.3.1.2. Suppose an SRS of two observation units is chosen from each of seven strata. We arbitrarily label one of the sampled units in stratum h as yh1 and the other as yh2. The sampled values are given in Table 13.2. Table 13.2: A Small Stratified Random Sample, Used to Illustrate BRR Stratum N h/N y h1 y h2 1e+06 .30 .10 .05 .10 .20 .05 .20 2e+27 2e+29 y h1 - yh2 1.9e+29 208 -210 -4,510 -450 2,036 446 36 The stratified estimate of the population mean is Ignoring the fpc’s (finite population correction) in Equation (4.5) gives the variance estimate when nh = 2, as here, Here, replacement. so This may overestimate the variance if sampling is without 227 To use the random group method, we would randomly select one of the observations in each stratum for group 1 and assign the other to group 2. The groups in this situation are half-samples. For example, group 1 might consist of {y11, y22, y32, y42, y51, y62, y71} and group 2 of the other seven observations. Then, and The random group estimate of the variance–in this case, 139,129–has only 1 df for a two-psu-perstratum design and is unstable in practice. If a different assignment of observations to groups had been made–had, for example, group 1 consisted of yh1 for strata 2, 3, and 5 and yh2 for strata 1, 4, 6 and 7–then 3238. and the random group estimate of the variance would have been McCarthy (1966; 1969) notes that altogether 2H possible half-samples could be formed and suggests using a balanced sample of the 2H possible half-samples to estimate the variance. Balanced repeated replication uses the variability among R replicate half-samples that are selected in a balanced way to estimate the variance of To define balance, let’s introduce the following notation. Half-sample r can be defined by a vector "r Let yh("r) Equivalently, yh("r) = If group 1 contains observations {y11, y22, y32, y42, y51, y62, y71} as above, then "1 = (1, -1, -1, -1, 1, -1, 1). Similarly " 2 = (-1, 1, 1, 1, -1, 1, -1). The set of R replicate half-samples is balanced if Let "r) be the estimate of interest, calculated the same way as but using only the observations in the half-sample selected by "r. For estimating the mean of a stratified sample, 228 " r) = yh("r). Define the BRR variance estimator to be "r)- . If the set of half-samples is balanced, then (The proof of this is left as " r) = for h = 1, . . . , H, then Exercise 6.) If, in addition For our example, the set of "’s in the following table meets the balancing condition for all l h. The 8 x 7 matrix of -1's and 1's has orthogonal columns; in fact, it is the design matrix (excluding the column of 1's) for a fractional factorial design (Box et al. 1978). Designs described by Plackett and Burman (1946) give matrices with k orthogonal columns, for k a multiple of 4; Wolter (1985) explicitly lists some of these matrices. Stratum (h) Half-Sample ®) The estimate from each half-sample, "1 "2 "3 "4 "5 "6 "7 "8 1 2 3 4 5 6 7 -1 1 -1 1 -1 1 -1 1 -1 -1 1 1 -1 -1 1 1 -1 -1 -1 -1 1 1 1 1 1 -1 -1 1 1 -1 -1 1 1 -1 1 -1 -1 1 -1 1 1 1 -1 -1 -1 -1 1 1 -1 1 1 -1 1 -1 -1 1 ("r) = Half-Sample (" r) is calculated from the data in Table 13.2. ( " r) 229 " r) - The average of " r) 12345678 4732.4 4439.8 4741.3 4344.3 4084.6 4592.0 4123.7 4555.5 78,792.5 141.6 83,868.2 11,534.8 134,762.4 19,684.1 107,584.0 10,774.4 Average 4451.7 55892.8 for the eight replicate half-samples is 55,892.75, which is the same as for sampling with replacement. Note that we can do the BRR estimation above by creating a new variable of weights for each replicate half-sample. The sampling weight for observation I in stratum h is whi = Nh / nh, and In BRR with a stratified random sample, we eliminate one of the two observations in stratum h to calculate yh("r). To compensate, we double the weight for the remaining observation. Define " r. ("r) = Then, ystr("r) = Similarly, for any statistic calculated using the weights whi, ("r) is calculated exactly the same way, but using the new weights whi("r). Using the new weight variables instead of selecting the subset of observations simplifies calculations for surveys with many response variables–the same column w("r) can be used to find the rth half-sample estimate for all quantities of interest. The 230 modified weights also make it easy to extend the method to stratified multistage samples. 13.3.1.2. BRR in a Stratified Multistage Survey When is the only quantity of interest in a stratified random sample, BRR is simply a fancy method of calculating the variance of Equation (4.5) and adds little extra to the procedure in Chapter 4. BRR’s value in a complex survey comes from its ability to estimate the variance of a general population quantity 2, where 2 may be a ratio of two variables, a correlation coefficient, a quantile, or another quantity of interest. Suppose the population has H strata, and two PSUs are selected from stratum h with unequal probabilities and with replacement. (In replication methods, we like sampling with replacement because the subsampling design does not affect the variance estimator, as we saw in Section 6.3). The same method may be used when sampling is done without replacement in each stratum, but the estimated variance of calculated under the assumption of with-replacement sampling, is expected to be larger than the without-replacement variance. The data file for a complex survey with two PSUs per stratum often resembles that shown in Table 13.3, after sorting by stratum and PSU. The vector "r defines the half-sample r: If "rh = 1, then all observation units in PSU 1 of stratum h are in half-sample r; if "rh = -1, then all observation units in PSU 2 of stratum h are in half-sample r. The vectors "r, are selected in a balanced way, exactly as in stratified random sampling. Now, for half-sample r, create a new column of weights w("r): wi("r) = Table 13.3: Data Structure After Sorting Observation Number Stratum Number PSU Number SSU Number W eight. w i Response Variable 1 Response Variable 2 Response Variable 3 1 2 3 4 5 6 7 8 9 10 11 etc. 1.111e+10 1.111e+10 1.234e+10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w 10 w 11 y1 y2 y3 y4 y5 y6 y7 y8 y9 y 10 y 11 x1 x2 x3 x4 x5 x6 x7 x8 x9 x 10 x 11 u1 u2 u3 u4 u5 u6 u7 u8 u9 u 10 u 11 For the data structure in Table 13.3, and "rh = -1 and "rh = 1, the column w("r) will be (0, 0, 0, 0, 2w5, 2w6, 2w7, 2w8, 2w9, 2w10, 2w11, ....). 231 Now use the column w("r) instead of w to estimate quantities for half-sample r. The estimate of the population total of y for the full sample is sample r is ("r) yi. If the estimate of the population total of Y for halfthen ("r) = and ("r) yi / ("r) xi. We saw in Section 7.3 that the empirical distribution function is calculated using the weights Then, the empirical distribution using half-sample r is If 2 is the population median, then and may be defined as the smallest value of y for which is the smallest value of y for which For any quantity, we define (13.6) BRR can also be used to estimate covariances of statistics: If 2 and 0 are two quantities of interest, then Other BRR variance estimators, variations of (13.6), are described in Exercise 7. While the exact equivalence of and does not extend to nonlinear statistics, Krewski and Rao (1981) and Rao and Wu (1985) show that if h is a smooth function of the population totals, the variance estimate from BRR is asymptotically equivalent to that from linearization. BRR also provides a consistent estimator of the variance for quantiles when a stratified random sample is taken (Shao and Wu 1992). Example 13.5 Bye and Gallicchio (1993) describe BRR estimates of variance in the U. S. Survey of Income and Program Participation (SIPP). SIPP, like the National Crime Victimization Survey (NCVS), has a 232 stratified multistage cluster design. Self-representing (SR) strata consist of one PSU that is sampled with probability 1, and one PSU is selected with PPS from each non-self-representing (NSR) stratum. Strictly speaking, BRR does not apply since only one PSU is selected in each stratum, and BRR requires two PSUs per stratum. To use BRR, “pseudostrata” and “pseudo-PSUs” were formed. A typical pseudostratum was formed by combining an SR stratum with two similar NSR strata: the PSU selected in each NSR stratum was randomly assigned to one of the two pseudo-PSUs, and the segments in the SR PSU were randomly split between the two pseudo-PSUs. This procedure created 72 pseudostrata, each with two pseudo-PSUs. The 72 half-samples, each containing the observations from one pseudo-PSU from each pseudostratum, were formed using a 71-factor Plackett-Burman (1946) design. This design is orthogonal, so the set of replicate half-samples is balanced. About 8500 of the 54,000 persons in the 1990 sample said they received Social Security benefits; Bye and Gallicchio wanted to estimate the mean and median monthly benefit amount for persons receiving benefits, for a variety of subpopulations. The mean monthly benefit for married males was estimated as where yi is the monthly benefit amount for person I in the sample, wi is the weight assigned to person I, and SM is the subset of the sample consisting of married males receiving Social Security benefits. The median benefit payment can be estimated from the empirical distribution function for the married men in the sample: The estimate of the sample median, satisfies but for all Calculating for a replicate is simple: merely define a new weight variable w("r), as previously described, and use w("r) instead of w to estimate the mean and median. Advantages BRR gives a variance estimate that is asymptotically equivalent to that from linearization methods for smooth functions of population totals and for quantiles. It requires relatively few computations when compared with the jackknife and the bootstrap. Disadvantages As defined earlier, BRR requires a two-PSU-per-stratum design. In practice, though, it is often extended to other sampling designs by using more complicated balancing schemes. BRR, like the jackknife and bootstrap, estimates the with-replacement variance and may overestimate the variance if the Nh’s, the number of PSUs in stratum h in the population, are small. 13.3.2. The Jackknife 233 The jackknife method, like BRR, extends the random group method by allowing the replicate groups to overlap. The jackknife was introduced by Quenouille (1949; 1956) as a method of reducing bias; Tukey (1958) used it to estimate variances and calculate confidence intervals. In this section, we describe the delete-1 jackknife; Shao and Tu (1995) discuss other forms of the jackknife and give theoretical results. For an SRS, let be the estimator of the same form as then but not using observation j. Thus, if For an SRS, define the delete-1 jackknife estimator (so called because we delete one observation in each replicate) as (13.7) Why the multiplier (n - 1) / n? Let’s look at when When Then, Thus, the with-replacement estimate of the variance of Example 13.6 Let’s use the jackknife to estimate the ratio of nonresident tuition to resident tuition for the first group of colleges in Table 13.1. Here, and For each jackknife group, omit one observation. Thus, (Table 13.4). Here, and 234 is the average of all x’s except for Table 13.4: Jackknife Calculations for Example 13.6 j 123 456 789 10 x 1e+38 y 1580.6 1545.9 1565.6 1612.2 1523.9 1391.0 1560.9 1628.9 1583.3 1597.8 4e+39 3617.7 3480.3 3867.3 3794.0 3759.0 3463.4 3595.1 3584.0 3574.0 3571.1 2.2889 2.2513 2.4703 2.3533 2.4667 2.4899 2.3032 2.2003 2.2573 2.2350 How can we extend this to a cluster sample? One might think that you could just delete one observation unit at a time, but that will not work–deleting one observation unit at a time destroys the cluster structure and gives an estimate of the variance that is only correct if the intraclass correlation is zero. In any resampling method and in the random group method, keep observation units within a PSU together while constructing the replicates–this preserves the dependence among observation units within the same PSU. For a cluster sample, then, we would apply the jackknife variance estimator in (13.7) by letting n be the number of PSUs and letting be the estimate of 2 that we would obtain by deleting all the observations in PSU j. In a stratified multistage cluster sample, the jackknife is applied separately in each stratum at the first stage of sampling, with one PSU deleted at a time. Suppose there are H strata, and nh PSUs are chosen for the sample from stratum h. Assume these PSUs are chosen with replacement. To apply the jackknife, delete one PSU at a time. Let when PSU j of stratum h is omitted. To calculate Then use the weights wi(h j) to calculate and (13.8) 235 be the estimator of the same form as define a new weight variable: Let Example 13.7 Here we use the jackknife to calculate the variance of the mean egg volume from Example 5.6. We calculated In that example, since we did not know the number of clutches in the population, we calculated the with-replacement variance. First, find the weight vector for each of the 184 jackknife iterations. We have only one stratum, so h = 1 for all observations. For delete the first PSU. Thus, the new weights for the observations in the first PSU are 0; the weights in all remaining PSUs are the previous weights times nh /(nh - 1) = 184/183. Using the weights from Example 5.8, the new jackknife weight columns are shown in Table 13.5. Table 13.5: Jackknife Weights, For Example 13.7 clutch 1 1 2 2 3 3 4 4 . . . 183 183 184 184 Sum csize relweight w(1,1) w(1,2) ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 13 13 13 13 6 6 11 11 . . . 13 13 12 12 6.5 6.5 6.5 6.5 3 3 5.5 5.5 . . . 6.5 6.5 6 6 0 0 6.535519 6.535519 3.016393 3.016393 5.530055 5.530055 . . . 6.535519 6.535519 6.032787 6.032787 6.535519 6.535519 0 0 3.016393 3.016393 5.530055 5.530055 . . . 6.535519 6.535519 6.032787 6.032787 3514 1757 1753.53 1753.53 w(1,184) 6.535519 6.535519 6.535519 6.535519 3.016393 3.016393 5.530055 5.530055 . . . 6.535519 6.535519 0 0 1754.54 Note that the sums of the jackknife weights vary from column to column because the original sample is not self-weighting. We calculated as to find follow the same procedure but use wi(h j) in place of wi. Thus, Using (13.8) then, we calculate same as calculated in Example 5.6. This results in a standard error of 0.061, the Advantages This is an all-purpose method. The same procedure is used to estimate the variance for every statistic for which the jackknife can be used. The jackknife works in stratified multistage samples in which BRR does not apply because more than two PSUs are sampled in each stratum. The jackknife provides a consistent estimator of the variance when 2 is a smooth function of population totals (Krewski and Rao 1981). 236 Disadvantages The jackknife performs poorly for estimating the variances of some statistics. For example, the jackknife produces a poor estimate of the variance of quantiles in an SRS. Little is known about how the jackknife performs in unequal-probability, without-replacement sampling designs in general. 13.3.3. The Bootstrap As with the jackknife, theoretical results for the bootstrap were developed for areas of statistics other than survey sampling; Shao and Tu (1995) summarize theoretical results for the bootstrap in complete survey samples. We first describe the bootstrap for an SRS with replacement, as developed by Efron (1979, 1982) and described in Efron and Tibshirani (1993). Suppose S is an SRS of size n. We hope, in drawing the sample, that it reproduces properties of the whole population. We then treat the sample S as if it were a population and take resamples from S. If the sample really is similar to the population–if the empirical probability mass function (epmf) of the sample is similar to the probability mass function of the population–then samples generated from the epmf should behave like samples taken from the population. Example 13.8 Let’s use the bootstrap to estimate the variance of the median height, 2, in the height population from Example 7.3, using the sample in the file ht.srs. The population median height is 2 = 168; the sample median from ht.srs is Figure 7.2, the probability mass function for the population, and Figure 7.3, the histogram of the sample, are similar in shape (largely because the sample size for the SRS is large), so we would expect that taking an SRS of size n with replacement from S would be like taking an SRS with replacement from the population. A resample from S, though, will not be exactly the same as S because the resample is with replacement–some observations in S may occur twice or more in the resample, while other observations in S may not occur at all. We take an SRS of size 200 with replacement from S to form the first resample. The first resample from S has an epmf similar to but not identical to that of S; the resample median Repeating the process, the second resample from S has median We take a total of R = 2000 resamples from S and calculate the sample median from each sample, obtaining We obtain the following frequency table for the 2000 sample medians: Frequency Median of Resample 1 165 5 166 2 166.5 40 167 15 167.5 268 168 237 87 168.5 739 169 111 169.5 491 170 44 170.5 188 171 5 171.5 4 172 The sample mean of these 2000 values is 169.3, and the sample variance of these 2000 values is 0.9148; this is the bootstrap estimator of the variance. The bootstrap distribution may be used to calculate a confidence interval directly: since it estimates the sampling distribution of a 95% CI is calculated by finding the 2.5 percentile and the 97.5 percentile of the bootstrap distribution. For this distribution, a 95% CI for the median is [167.5, 171]. If the original SRS is without replacement, Gross (1980) proposes creating N/n copies of the sample to form a “pseudopopulation,” then drawing R SRSs without replacement from the pseudopopulation. If n/N is small, the with-replacement and without-replacement bootstrap distributions should be similar. Sitter (1992) describes and compares three bootstrap methods for complex surveys. In all these methods, bootstrapping is applied within each stratum. Here are steps for using one version of the rescaling bootstrap of Rao and Wu (1988) for a stratified random sample: 1. For each stratum, draw an SRS of size (nh - 1) with replacement from the sample in stratum h. Do this independently for each stratum. 2. For each resample r ® = 1, 2, . . . , R), create a new weight variable where mi®) is the number of times that observation I is selected to be in the resample. Calculate using the weights wi®). 3. Repeat steps 1 and 2 R times, for R a large number. 4. Calculate 238 Advantages The bootstrap will work for nonsmooth functions (such as quantiles) in general sampling designs. The bootstrap is well suited for finding confidence intervals directly: to get a 90% CI, merely take the 5th and 95th percentiles from that described in Efron (1982). or use a bootstrap-t method such as Disadvantages The bootstrap requires more computations than BRR or jackknife since R is typically a very large number. Compared with BRR and jackknife, less theoretical work has been done on properties of the bootstrap in complex sampling designs. 13.4 Generalized Variance Functions In many large government surveys such as the U. S. Current Population Survey (CPS) or the Canadian Labour Force Survey, hundreds or thousands of estimates are calculated and published. The agencies analyzing the survey results could calculate standard errors for each published estimate and publish additional tables of the standard errors but that would add greatly to the labor involved in publishing timely estimates from the surveys. In addition, other analysts of the public-use tapes may wish to calculate additional estimates, and the public-use tapes may not provide enough information to allow calculation of standard errors. Generalized variance functions (GVFs) are provided in a number of surveys to calculate standard errors. They have been used for the CPS since 1947. Here, we describe some GVFs in the 1990 NCVS. Criminal Victimization in the United States, 1990 (U. S. Department of Justice 1992, 146) gives GVF formulas for calculating standard errors. If victimized by a particular type of crime or if is an estimated number of persons or households estimates a total number of victimization incidents, (13.9) If p is an estimated proportion, (13.10) where is the estimated base population for the proportion. For the 1990 NCVS, the values of a and b were a = -.00001833 and b = 3725. For example, it was estimated that 1.23% of persons aged 20 to 24 were robbed in 1990 and that 18,017,100 persons were in that age group. Thus, the GVF estimate of SE(p) is Assuming that asymptotic results apply, this gives an approximate 95% CI of .0123 ± (1.96)(.0016), or [.0091, .0153]. 239 There were an estimated 800,510 completed robberies in 1990. Using (13.9), the standard error of this estimate is Where do these formulas come from? Suppose Ti is the total is the total number of observation units belonging to a class–say, the total number of persons in the United States who were victims of violent crime in 1990. Let Pi = Ti/N, the proportion of persons in the population belonging to that class. If di is the design effect (deff) in the survey for estimating Pi (see Section 7.5), then (13.11) where bi = di x (N / n). Similarly, where ai = -di / n. If estimating a proportion in a domain–say, the proportion of persons in the 20-24 age group who were robbery victims–the denominator in (13.11) is changed to the estimated population size of the domain (see Section 3.3). If the deff’s are similar for different estimates so that and then constants a and b can be estimated that give (13.9) and (13.10) as approximations to the variance for a number of quantities. The general procedure for constructing a generalized variance function is as follows: 1. Using replication or some other method, estimate variances for k population totals of special interest, Let vi be the relative variance for for I = 1, 2, . . . , k. 2. Postulate a model relating vi to Many surveys use the model This is a linear regression model with response variable vi and explanatory variable Valliant (1987) found that this model produces consistent estimates of the variances for the class of superpopulation models he studied. 3. Use regression techniques to estimate " and $. Valliant (1987) suggests using weighted least squares to estimate the parameters, giving higher weight to items with small vi. The GVF 240 estimate of variance, then, is the predicted value from the regression equation, The ai and bi for individual items are replaced by quantities a and b, which are calculated from all k items. For the 1990 NCVS, b = 3725. Most weights in the 1990 NCVS are between 1500 and 2500; b approximately equals the (average weight) x (deff), if the overall design effect is about 2. Valliant (1987) found that if deff’s for the k estimated totals are similar, the GVF variances were often more stable than the direct estimate, as they smooth out some of the fluctuations from item to item. If a quantity of interest does not follow the model in step 2, however, the GVF estimate of the variance is likely to be poor, and you can only know that it is poor by calculating the variance directly. Advantages The GVF may be used when insufficient information is provided on the public-use tapes to allow direct calculation of standard errors. The data collector can calculate the GVF, and the data collector often has more information for estimating variances than is released to the public. A generalized variance function saves a great deal of time and speeds production of annual reports. It is also useful for designing similar surveys in the future. Disadvantages The model relating vi to may not be appropriate for the quantity you are interested in, resulting in an unreliable estimate of the variance. You must be careful about using GVFs for estimates not included when calculating the regression parameters. If a subpopulation has an unusually high degree of clustering (and hence a high deff), the GVF estimate of the variance may be much too small. 13.5 Confidence Intervals 13.5.1. Confidence Intervals for Smooth Functions of Population Totals Theoretical results exist for most of the variance estimation methods discussed in this chapter, stating that under certain assumptions asymptotically follows a standard normal distribution. These results and conditions are given in Binder (1983), for linearization estimates; in Krewski and Rao (1981) and Rao and Wu (1985), for jackknife and BRR; in Rao and Wu (1988) and Sitter (1992), for bootstrap. Consequently, when the assumptions are met, an approximate 95% confidence interval for 2 may be constructed as Alternatively, a tdf percentile may be substituted for 1.96, with df = (number of groups - 1) for the random group method. Rust and Rao (1996) give guidelines for appropriate df’s for other methods. Roughly speaking, the assumptions for linearization, jackknife, BRR, and bootstrap are as follows: 1. The quantity of interest 2 can be expressed as a smooth function of the population totals; 241 more precisely, 2 = h(T1, T2, . . . , Tk), where the second-order partial derivatives of h are continuous. 2. The sample sizes are large: either the number of PSUs sampled in each stratum is large, or the survey contains a large number of strata. (See Rao and Wu 1985 for the precise technical conditions needed.) Also, to construct a confidence interval using the normal distribution, the sample sizes must be large enough so that the sampling distribution of normal. is approximately Furthermore, a number of simulation studies indicate that these confidence intervals behave well in practice. Wolter (1985) summarizes some of the simulation studies; others are found in Kovar et al. (1988) and Rao et al. (1992). These studies indicate that the jackknife and linearization methods tend to give similar estimates of the variance, while the bootstrap and BRR procedures give slightly larger estimates. Sometimes a transformation may be used so that the sampling distribution of a statistic is closer to a normal distribution: if estimating total income, for example, a log transformation may be used because the distribution of income is extremely skewed. 13.5.2. Confidence Intervals for Population Quantiles The theoretical results described above for BRR, jackknife, bootstrap, and linearization do not apply to population quantiles, however, because they are not smooth functions of population totals. Special methods have been developed to construct confidence intervals for quantiles; McCarthy (1993) compares several confidence intervals for the median, and his discussion applies to other quantiles as well. Let q be between 0 and 1. Then define the quantile 2q as 2q = F-1(q), where F-1(q) is defined to be the smallest value y satisfying F(y) $q. Similarly, define Now F-1 and are not smooth functions, but we assume the population and sample are large enough so that they can be well approximated by continuous functions. Some of the methods already discussed work quite well for constructing confidence intervals for quantiles. The random group method works well if the number of random groups, R, is moderate. Let be the estimated quantile from random group r. Then, a confidence interval for 2q is where t is the appropriate percentile from a t distribution with R - 1 df. Similarly, empirical studies by McCarthy (1993), Kovar et al. (1988), Sitter (1992), and Rao et al. (1992) indicate that in certain designs confidence intervals can be formed using 242 where the variance estimate is calculated using BRR or bootstrap. An alternative interval can be constructed based on a method introduced by Woodruff (1952). For any y, where ui = 1 if is a function of population totals: and ui = 0 if yi > y. Thus, a method in this chapter can be used to estimate and an approximate 95% CI for F(y) is given by for any value y, Now let’s use the confidence interval for q = F(2q) to obtain an approximate confidence interval for 2q. Since we have a 95% CI, = Figure 13.2 Woodruff’s confidence interval for the quantile 2q if the empirical distribution function is continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI) for any value of y, shown on the vertical axis. We then look at the corresponding points on the horizontal axis for form a confidence interval for 2q. 243 Figure 13.2 shows Woodruff’s confidence interval for the quantile 2q if the empirical distribution function is continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI) for any value of y, shown on the vertical axis. We then look at the corresponding points on the horizontal axis to form a confidence interval for 2q. So an approximate 95% CI for the quantile 2q is The derivation of this confidence interval is illustrated in Figure 13.2. Now we need several technical assumptions to use the Woodruff-method interval. These assumptions are stated by Rao and Wu (1987) and Francisco and Fuller (1991), who studied a similar confidence interval. Basically, the problem is that both F and are step functions; they have jumps at the values of y in the population and sample. The technical conditions basically say that the jumps in F and in should be small and that the sampling distribution of is approximately normal. Example 13.9 Let’s use Woodruff’s method to construct a 95% CI for the median height in the file ht.srs, discussed in Examples 7.3 and 13.8. Note that is the sample proportion of observations in the SRS that take on value at most 2q; so, ignoring the fpc, Thus, for this sample, The lower confidence bound for the median is then bound for the median is and the upper confidence As heights were only measured to the nearest centimeter, we’ll use linear interpolation to smooth the step function for the empirical distribution function: y 244 The following values were obtained 167 168 170 171 172 0.405 0.440 0.515 0.550 0.605 Then, interpolating, and Thus, an approximate 95% CI for the median is [167.7, 171.4]. 13.5.3. Conditional Confidence Intervals The confidence intervals presented so far in this chapter have been developed under the design-based approach. A 95% CI may be interpreted in the repeated-sampling sense that, if samples were repeatedly taken from the finite population, we would expect 95% of the resulting confidence intervals to include the true value of the quantity in the population. Sometimes, especially in situations when ratio estimation or poststratification are used, you may want to consider constructing a conditional confidence interval instead. In poststratification as used for nonresponse (Section 8.5.2.), the respondent sample sizes nhR, was presented. A 95% conditional confidence interval, constructed using the variance in (8.3), would have the interpretation that we would expect 95% of all samples having those specific values of nhR to yield confidence intervals containing The theory of conditional confidence intervals is beyond the scope of this book; we refer the reader to Särndal et al. (1992, sec. 7.10), Casady and Valliant (1993), and Thompson (1997, sec. 5.12) for more discussion and bibliography. 13.6 Summary and Software This chapter has briefly introduced you to some basic types of variance estimation methods that are used in practice: linearization, random groups, replication, and generalized variance functions. But this is just an introduction; you are encouraged to read some of the references mentioned in this chapter before applying these methods to your own complex survey. Much of the research done exploring properties and behavior of these methods has been done since 1980, and variance estimation methods are still a subject of research by statisticians. 245 Linearization methods are perhaps the most thoroughly researched in terms of theoretical properties and have been widely used to find variance estimates in complex surveys. The main drawback of linearization, though, is that the derivatives need to be calculated for each statistic of interest, and this complicates the programs for estimating variances. If the statistic you are interested in is not handled in the software, you must write your own code. The random group method is an intuitively appealing method for estimating variances. Easy to explain and to compute, it can be used for almost any statistic of interest. Its main drawback is that we generally need enough random groups to have a stable estimate of the variance, and the number of random groups we can form is limited by the number of PSUs sampled in a stratum. Resampling methods for stratified multistage surveys avoid partial derivatives by computing estimates for subsamples of the complete sample. They must be constructed carefully, however, so that the correlation of observations in the same cluster is preserved in the resampling. Resampling methods require more computing time than linearization but less programming time: the same method is used on all statistics. They have been shown to be equivalent to linearization for large samples when the characteristic of interest is a smooth function of population totals. The BRR method can be used with almost any statistic, but it is usually used only for two-PSU-perstratum designs or for designs that can be reformulated into two PSU per strata. The jackknife and bootstrap can also be used for most estimators likely to be used in surveys (exception: the delete-1 jackknife may not work well for estimating the variance of quantiles) and may be used in stratified multistage samples in which more than two PSUs are selected in each sample, but they require more computing than BRR. Generalized variance functions are cheap and easy to use but have one major drawback: unless you can calculate the variance using one of the other methods, you cannot be sure that your statistic follows the model used to develop the GVF. All methods except GVFs assume that information on the clustering is available to the data analyst. In many surveys, such information is not released because it might lead to identification of the respondents. See Dippo et al. (1984) for a discussion of this problem. Various software packages have been developed to assist in analyzing data from complex surveys. Cohen (1997), Lepkowski and Bowles (1996), and Carlson et al. (1993) evaluate PC-based packages for analysis of complex survey data.1 SUDAAN (Shah et al. 1995), OSIRIS (Lepkowski 1982), Stata (StataCorp 1996), and PC-CARP (Fuller et al 1989) all use linearization methods to estimate variances of nonlinear statistics. SUDAAN, for example, calculates variances of estimated population totals for various stratified multistage sampling designs that have H strata, unequalprobability cluster sampling with or without replacement at the first stage of sampling, and SRS with or without replacement at subsequent stages. The formula in (6.9) is used to estimate the variance 1 Lepkowski and Bowles (1996) tell how to access the free (or almost-free) software packages CENVAR, CLUSTERS, Epi Info, VPLX, and W esVarPC through e-mail or from the internet. Software for analysis of survey data is changing rapidly; the Survey Research Methods Section of the American Statistical Association (www.amstat.org) is a good resource for updated information. 246 for each stratum in with-replacement sampling, and the Sen-Yates-Grundy form in (6.15) is used for without-replacement variance. Then, the variances for the totals in the strata are added to estimate the variance for the estimated population total. SUDAAN then uses linearization to find variances for ratios, regression coefficients, and other nonlinear statistics. Recent versions of SUDAAN also implement BRR and jackknife. OSIRIS also implements BRR and jackknife methods. The survey software packages WesVarPC (Brick et al. 1996; at press time, WesVarPC could be downloaded free from www.westat.com) and VPLX (Fay 1990) both use resampling methods to calculate variance estimates. A simple S-PLUS function for jackknife is given in Appendix D; this is not intended to substitute for well-tested commercial software but to give you an idea of how these calculations might be done. Then, after you understand the principles of the methods, you can use commercial software for your complex surveys. 247 CHAPTER 14 SAMPLING FOR OBJECTIVE MEASUREMENT SURVEYS IN AGRICULTURE 14.1 NEED FOR OBJECTIVE MEASUREMENTS The principles of sampling discussed in the previous lectures are widely applicable to survey programs generally. Certain kinds of surveys, however, may require special techniques of sampling and data collection which are determined by the nature of the inquiry or the ability of respondents to give accurate answers. Chapter 12 describes some special techniques used in agriculture surveys. Statistics on area planted with individual crops and on yields from these crops are, in most countries, based upon periodic reports from crop reporters. In some countries, these reporters are holders or other individuals who reside in the rural areas and have knowledge of the local agriculture; they report voluntarily, usually by mail. In other countries the reporters are government officials or agents. The reports submitted by these agents are usually less accurate than those submitted by private individuals, in part because the agents are usually reporting for a much larger area and in part because the agents are not so closely connected with agriculture. However, whether made by private individuals or by government agents, these reports are all subject to biases which are often large and always difficult to evaluate. For example, investigations in various countries have shown that in estimating yields, reporters (particularly official reporters) have a tendency to be biased toward the normal; in other words, in good years they tend to underestimate the yield whereas in bad years they tend to overestimate. Although private reporters also have this tendency to some extent, they are generally more inclined to underestimate in the belief that it will be to their advantage to do so. Areas, on the other hand, tend to be overestimated because of the difficulty of making proper allowances for non planted areas around the edges of fields and areas within the fields that cannot be planted. Check data from past years can be used to evaluate the biases in the estimates of production obtained from reporters. For crops such as tobacco or cotton, which must be processed before being used, information on production can be obtained from the processors and compared with the corresponding figures obtained from reporters. For other crops, similar use can be made of data obtained from marketing or shipping sources. If such data are complete (usually there is no guarantee that they are complete) and if the relative bias remains reasonably constant from year to year, estimates for the current year can be adjusted on the basis of this past experience. For other crops, which are at least partly consumed locally, fed to livestock, etc., such check data are not available. Census data, if available, can be used as a benchmark for adjusting the reports for these crops. However, the census data are also subject to reporting biases. Furthermore, adjustments using census data become less and less reliable as the time lapse between the last census and the current year widens. Experience in many different countries under a variety of conditions has indicated that subjective methods of estimating production, even when other data are available for adjusting the estimates, cannot provide reliable results. If accurate and unbiased estimates are required, the only alternative is to establish some type of program utilizing objective methods of observation applied on a random sampling basis. Such surveys are called "objective measurement surveys" because the data are collected by actual observation and measurement or counting, rather than by methods depending on the judgment, good memory, or education of persons who report the required information. Even though such a program of objective measurement surveys is relatively costly and difficult to carry out, the results will usually justify the effort. 14.2 DESIGNING THE SAMPLE The theoretical considerations affecting sample design, discussed in previous lectures, are as relevant to the design of an objective measurement survey as they are to any other survey. 14.2.1 Type of estimates required The sampling statistician must know whether estimates are required for the nation as a whole, for the Provinces or districts individually, or for some other administrative areas. The sample allocation must be planned to give estimates for the desired areas at an acceptable level of reliability. If an estimate of the number of holdings (either in total or for a specific crop) is also required, this must be considered in designing the sample. 14.2.2 Stratification First-level strata often consist of the smallest areas requiring separate estimates. Further gains in efficiency may be obtained by further stratification into geographic areas having relatively homogeneous yield rates for the crop. Other bases for stratification, such as irrigated and nonirrigated land, varieties of crops, etc., may also be used. 14.2.3 Allocation to strata The statistician must decide how to allocate the sample to strata. A common practice is to allocate it proportionately to the area under the particular crop or group of crops being investigated. If available, knowledge about the relative variances and/or the relative costs of performing the field work in the different strata should also be used in allocating the sample. 14.2.4 Sampling within strata A decision must be made on the method of sampling within strata. As was indicated before, there are usually several possible sampling units and sample designs. In deciding upon a sampling plan, the sampling statistician will need to know what materials are available for constructing the sampling frame and what types of data are required. His choice may also be influenced by other factors such as the availability of capable personnel to carry out the work. However, even with the restrictions imposed by these considerations, there will usually be a number of possible choices. 14.2.4.1 Sample stages and types of sampling units In most practical applications, several sampling stages and sampling units will be used within strata. For example, if the strata are large administrative divisions, such as Provinces, a sample of districts might be selected at the first stage and a sample of subdistricts within sample districts at the second stage. Where "villages" have identifiable boundaries and account for all the land, they can serve as convenient units at some stage in the sampling. The ultimate unit of analysis will usually be an individual holding, the individual field, or (for studies involving estimation of yields) small plots within fields. If the field is the unit of analysis, holdings may be selected at the preceding stage. 249 14.2.4.2 Methods of selecting holdings and fields The following examples illustrate some procedures that can be used to select holdings and fields in the final stages of the sample design. The selection of plots within fields is discussed in section 14.4.4 of this chapter. (1) Holdings can be selected from lists if lists are available or can be constructed without much difficulty. Lists of holdings would be needed only for the units (villages, subdistricts, etc.) actually selected in the sample at the preceding stage; if necessary, these could be compiled as part of the field operation. The selection of holdings can be made either with equal probability or with probability proportionate to size (assuming that information on size is available or can be obtained). The measure of size might be total reported area in the holding, total area in a particular crop or group of crops, etc. Similarly, within each selected holding, a list of fields could be compiled and a sample selected. Again, selection could be made either with equal probability or with probability proportionate to size. (2) If maps or aerial photographs are available, these can be used to select fields directly without first selecting holdings. One way to do this is to superimpose on the map or photo a grid on which dots have been placed either in a systematic pattern or at random; each field into which a dot falls is then included in the sample, thus giving the fields probabilities of selection proportionate to their sizes. This procedure requires, of course, that the maps or photos be sufficiently detailed so that the point and the corresponding field can be located on the ground. (This procedure is not easily adaptable to estimating number of holdings, if that is desired.) (3) Area segments are useful sampling units for determining which holdings and/or fields are to be included in the sample. These segments may be constructed either with natural boundaries that can be located on the ground or with imaginary boundaries drawn on a photo or map; the choice depends upon the particular situation. Holdings and/or fields may be associated with area segments in any of the following ways: (a) Area segments with imaginary boundaries could be used as first-stage sampling units and a sample of segments selected; within the sample segments, fields could be selected as second-stage units in the manner described above in (2). (b) An alternative procedure would be to include in the sample all fields (or holdings) for which a uniquely defined point falls within the segment boundaries. With this procedure, fields (or holdings) would not be selected with probability proportionate to their sizes; the probability of selection would be the same as the probability of selection of the segment into which the point falls. This is known as an open segment approach. The segments determine which units are included in the sample, but data are tabulated for some fields (or holdings) lying partly outside the segment and are not tabulated for other fields (or holdings) lying partly inside the segment. 250 The unique point must be defined with care. Usually a particular corner of the field (holding) would be designated as the unique point. Because fields (holdings) may not be rectangular, a specific rule for locating this corner would be needed as well. For example, if the northwest corner were the designated unique point, it could be defined either (1) by identifying the boundary points that lie farthest west and then designating the most northern of these points as the northwest corner or (2) by identifying the boundary points that lie farthest north and then designating the most western of these points as the northwest corner. If the holding were the unit of analysis, the residence of the holder (provided all such residences had a chance of being included in the sample) would generally be preferred as the unique point since it would be the easiest point to locate. A combination of rules is, perhaps, even more useful. For example, the residence of the holder might be used when the holder lives on the holding, and a particular corner used when he does not live on the holding. In any case, the point must be defined in a way such that it is truly unique (that is, each unit must have one, and only one, such point associated with it and thus have one, and only one, chance of being included in the sample); it should also be fairly easy to identify. (c) If the unit of analysis is the holding, the weighted segment approach will usually be more efficient than the open segment approach. With this procedure, all holdings having any land in the segment are included in the sample. In the estimation, the data from each holding are weighted by a factor based on the proportion of the entire holding lying inside the segments. In almost all applications, the weighted segment approach requires that the segments have natural boundaries that can be identified on the ground. (d) Still another possibility is to use the so-called closed-segment approach in which only those fields or parts of fields lying within the segments are included in the sample. One advantage of this procedure is that it avoids the difficulty of having to define the holding. Of course, if information is desired on a holding basis, the closed-segment approach is not appropriate since some holdings will certainly extend beyond the segment boundaries. 14.3 OBJECTIVE MEASUREMENT PROCEDURES FOR THE ESTIMATION OF AREA Since it is known that data on land area obtained by asking individuals to respond to questionnaires can be very inaccurate, other means of obtaining these data have been investigated.1 The usual approach in objective measurement surveys is to select a sample of areas, and then to go to these areas and measure them directly. There are also methods of obtaining objective estimates of area that do not require direct measurement of the land; for example, measuring the area on aerial photographs. In addition to the measurements, other information may be obtained. For example, the land may be classified into various categories according to its use (crop land, pasture, wasteland, etc.), the particular crop being grown on each piece of land may be identified, etc. 1. For discussion of techniques and experiences in many countries, see S. S. Zarkovich (ed.), Estimation of Areas in Agricultural Statistics, Food and Agriculture Organization of the United Nations, Rome, 1965. 251 14.3.1 Measurement of land area The first step in making direct measurements of land is to make a scale drawing. In order to do this, one must be able to measure distances and angles. A drawing made by a professional land surveyor using technical equipment would be very precise. On the other hand, a drawing made by an inexperienced worker measuring distances by pacing and measuring angles by eye estimates would not be very accurate. Between these extremes, there are many other methods that can be used. One should balance the relative cost against the relative accuracy of the various procedures and select the method that will provide an acceptable level of reliability for the lowest cost. After the scale drawing has been made, the area of the drawing must be determined. If the land that was measured is in the shape of a regular geometric figure such as a rectangle, trapezoid, etc., it is relatively easy to determine the area of the drawing by standard mathematical formulas. Using the appropriate expansion factor, the area of the land represented by the drawing can then be determined. Often, however, the area is of irregular shape and other methods must be used; for example, triangulation, planimetering, gridding, dot counting, and map cutting and weighing. 14.3.1.1 Triangulation.--In triangulation, the polygon formed by the drawing is converted into simple triangles. It is a principle of geometry that this can always be done. (Curved boundaries are roughly approximated by a series of straight lines before triangulation.) Each triangle is measured and the area computed by standard formulas. This procedure is time consuming and tedious and has largely been replaced. 14.3.1.2 Planimetering.--A planimeter is an instrument with which one can determine the area of a closed figure by tracing around the boundary of the figure with a pencil-like device. A good planimeter will give very accurate results. It does, however, require a skilled operator and much time. 14.3.1.3 Gridding.--Basically, a grid is a plane divided into small squares (for example, a piece of ordinary graph paper). For use in measuring area, the squares are constructed so that each is equivalent to a particular amount of area in accordance with the scale of the drawing. A transparent plastic grid can be placed over the drawing; or the grid can be printed on paper and the drawing made directly on this paper. To estimate the area represented by the drawing, one counts the whole squares and parts of squares within the perimeter of the scale drawing and converts this number to its equivalent in terms of the appropriate unit of area. 252 Figure 1: MEASUREMENT BY GRIDDING (1 SQUARE = 1/4 HECTARE) Although not as accurate as planimetering, gridding can be done in less time. It requires only that the individual be able to count accurately and that he be able to accurately convert the partial squares into an equivalent number of whole squares. See Figure 1 on the preceding page for an illustration of this method. There are approximately 159 squares within the scale drawing (including the partial squares that overlap the boundary); thus, since each square represents 1/4 hectare, the field contains about 40 hectares. 14.3.1.4 Dot counting. Dot counting is essentially the same as gridding except that instead of small squares, the grid consists of uniformly spaced dots. Each dot represents a unit area according to the scale of the drawing. One need only count the dots lying within the perimeter of the drawing to find the area. If any dots lie on the boundary, only half of them are counted. 14.3.1.5 Map cutting and weighing.--By this procedure, the map or photograph of the area is carefully cut into pieces representing different categories of land along the lines drawn by the field worker. Each piece is then carefully weighed. The estimation is based on the 253 weight of the paper in each category relative to the weight for the entire area. This procedure is not very practical; it is time consuming and requires a weighing instrument of high precision and map paper of uniform quality. 14.3.2 Observation of land uses for a sample of points or lines Some methods of objectively measuring area do not require direct measurement of the land itself. Instead, the proportion of land falling into various categories is estimated by some objective means and multiplied by the known total area of land in the universe (Province, district, etc.) to estimate the total area in each category. All of the methods discussed in section 3.2 except the last method (the last method described in paragraph 3.22) require accurate, up-to-date maps or aerial photographs; consequently, their usefulness is somewhat limited at this time. However, as progress is made in aerial photography, these and similar methods are likely to become more generally useful in the future. 14.3.2.1 Observations for a sample of points.--A sample of points is selected and the points marked on maps or aerial photographs. In selecting the sample of points, appropriate techniques of stratification and clustering should be used to maximize the efficiency of the design. For example, if primary interest is in the estimation of crop areas, higher sampling rates should be used in those portions of the universe known to consist primarily of crop land. If only broad categories of land use are to be estimated, and suitable aerial photographs are available, it may be possible to make the necessary observations directly from the photographs. For most purposes, however, it will be necessary to send observers to the field to locate each sample point and to record the crop being grown or other use being made of the land at the point. One author has suggested that for periodic surveys the sample points be permanently identified by suitable markers, to make them easier to locate. The markers could not be placed at the exact locations of the sample points, since they would interfere with farming operations; however, they would be placed nearby and equipped with sighting devices aimed at the sample points. This method has not yet been tried in the field. (Refer to "Fixed-Point Sampling--A New Method of Estimating Crop Areas" by Thomas B. Jabine in Estadistica, published by the Inter-American Statistical Institute, Washington, D.C., September-December 1967.) Once the observations have been made for the sample of points, one can make an unbiased estimate of area devoted to a particular use: (1) For each stratum in which points were sampled at a constant rate, tally the number of sample points in each land use category. (2) Multiply the known total area of the stratum by the proportion of sample points devoted to that use. (3) Sum over all strata. 254 14.3.2.2 Observations for a sample of lines.--A sample of lines is selected and the lines are marked on maps or aerial photographs. As in the case of points, appropriate techniques of stratification and clustering should be used to increase the efficiency of the design. The usual procedure within ultimate sampling units is to select a sample of parallel lines spaced at equal intervals. By using aerial photographs, or by actually pacing the lines, the investigator determines the proportion of each line falling into each land use category. Unbiased estimates are then made from these observations by a procedure completely analogous to that described above for point samples. A relatively cheap but biased form of line sampling involves the substitution of roads for a probability sample of lines. The investigator drives a car along a prescribed route. The car is equipped with a distance measuring device. As he drives, the investigator notes and records the distance for which the road is bordered by each category of land being measured (specific crops, crop land in general, pasture, woodland, etc.). Estimates are then made in the normal way for line sampling. This last technique is likely to be seriously biased, especially in areas where the road network is sparse, since the pattern of land use along roads is likely to differ substantially from the overall pattern for a given area. Techniques based on probability sampling should be used in preference if at all possible. 14.3.3 Use of ratio estimation and double sampling to improve efficiency Having completed area measurements on the holdings (or other units of analysis) in the sample, we can estimate totals directly from these data by the estimation procedure which is appropriate to the particular sample design. This procedure can usually be improved upon, however, if in addition to making area measurements for a sample of the population, we also have available less accurate and less expensive area data (for example, data obtained by direct interview) from the entire population. Such data would normally come from a complete census. By means of ratio estimation, we can often obtain estimates of population totals that will be more reliable than those that could be obtained from either the objective measurements or the interview responses alone. The procedure is essentially the same as that discussed in section 2.3 of chapter 10. The X-characteristic in this case would be the actual measurement of the land obtained for a subset of the population; the Y-characteristic would be the data collected by the interview. Even more useful and practical is a technique called double sampling2 in which the less expensive technique is used to obtain data from a relatively large sample of the population and the more expensive technique to obtain data from a subsample of the basic sample. Again, ratio estimation is used, but here the Y-characteristic is the response that is obtained by the less expensive technique, and the sample estimate of the population total for the Y-characteristic is used in place of a total based on 100-percent coverage. 2. Double sampling is a statistical technique useful in a variety of situations whenever a characteristic of interest that is difficult or expensive to determine is correlated highly with another characteristic that can be determined relatively easily or inexpensively. 255 Compared with the method based on area measurement alone, methods using ratio estimation will be preferred if the gain in efficiency more than offsets the cost of obtaining the supplementary observations by the less expensive technique (either from the entire population or, in the case of double sampling, from a larger sample from the population). The factors to be considered are: (1) The strength of the relationship between the data obtained by the two methods. The interview response must have a high positive correlation with the area measurement if a significant improvement is to be obtained. One would reasonably expect this to be the case. (2) The relative cost of the two methods. Assuming that the correlation is large enough, ratio estimation will reduce the number of holdings requiring area measurement in order to achieve a given level of reliability. Whether or not this reduction will offset the cost of obtaining the interview responses depends in part upon the difference in costs between the two types of observations. Compared with the method based only on interview responses, the use of ratio estimation will be preferred whenever it is believed that the bias in the interview responses is sufficient to justify the additional expense of obtaining the area measurements. The concept of mean square error (MSE) is needed to understand the situation more fully. Recall from previous chapters that the variance is based on differences between estimates (x') based on samples and the value X that would be obtained if data had been collected from all members of the population, using the same techniques. The mean square error, on the other hand, is based on differences between estimates based on samples and the true value of the quantity being measured (XT). If the data-collection technique is unbiased, X = XT, then the MSE is equivalent to the variance; if the technique is biased, the MSE is equal to the variance plus the square of the bias (X - XT), or (14.1) MSE = . For a given cost, data can be obtained by interview from a sample of a certain size. For the same cost, data can be obtained by interview from a smaller sample, combined with objective measurements from a subsample of this sample. Estimates based on the large interview sample will have a specified MSE containing a bias component as well as a variance component. Ratio estimates based on the combination of interview and objective measurement data will have a smaller bias but a larger variance. The MSE may be either larger or smaller than the MSE based only on the large interview sample depending on the variability in the population, the relative cost of the two procedures (which determines the relative sample sizes), the relative size of the biases (or the effectiveness of the ratio estimation procedure in reducing the bias), etc. The sampling statistician must consider all of these factors in allocating the available resources between the two procedures. His goal is to minimize the MSE for a given cost (or to minimize the cost of obtaining an acceptable level of reliability). 14.4 OBJECTIVE MEASUREMENT OF YIELD The goal of objective measurement of yield is usually to estimate the yield of a crop on a unit basis (such as bushels per acre, quintals per hectare, etc.). In order to estimate the total production, it is necessary to have also an estimate of the total area of the crop in question planted. In some 256 instances, only the yield is estimated by objective means, although estimates of both the yield and the area should be based on objective measurements. The general procedure in making objective measurements of yield (usually called "crop cutting") is to use a random process to select areas (usually called plots) planted, and to cut and weigh the produce from each of these plots at or near the time the remainder of the field is harvested.3 Each different crop has different characteristics, and the same crop will behave differently in different parts of the world. Consequently, there is no specific set of rules that can be applied to all crops or even to the same crops in different locations. We will, however, discuss in general terms some of the factors to be considered in planning such a program and describe some of the techniques that have been used in the past. 14.4.1 Pilot studies Because information gained about other crops or about the behavior of the crop in question in other countries is not directly transferable to one's own situation, pilot studies should be carried out before establishing any program for objective measurement of yield. Pilot studies can provide important information about most of the things that need to be considered such as sampling variability, optimum size and shape of plot, harvesting procedures, problems such as personnel and materials needed to carry out the work, etc. They are also useful as training devices for those who will eventually be in charge of the full-scale operation. On the basis of the pilot studies, the investigator can develop a sampling plan and field procedures appropriate to the conditions under which the survey will be conducted. After a procedure has been decided upon, it is usually advisable to put it into operation only gradually and, after it is in full operation to carry it out for a few years simultaneously with the procedure it is to replace. The existing program, no matter how inadequate it may be, should not be ended until the proposed new method has been sufficiently tested and found to be clearly superior and operationally feasible.4 After its superiority and feasibility have been established, the new method can then serve as a basis for evaluating the bias in the old method which would not be possible unless the two were conducted simultaneously for a few years. This is particularly important to users of the data who are interested in examining differences or trends over a period of years; they must know to what extent observed differences in the data are simply the result of differences in measurement technique. 14.4.2 Variability One must have some idea of the variability in yield of the crop to be measured in order to plan wisely. Two aspects of variability which are of interest are: 3. Objective measurements are also used to forecast yields on the basis of observations made earlier in the season. Since the sampling procedures used in forecasting yields are quite similar to those used in estimating yields, only the latter are discussed in this section. 4. Actually, it may be necessary to continue the existing program in any case, particularly if data are required for administrative areas different from those for which estimates are made using objective data. Furthermore, the existing program may collect data on a number of crops which are not economically important enough to justify an expensive objective measurement program. 257 (1) The relative variability of yields for different sizes and shapes of plots. (2) For a plot of given size and shape, the relative magnitude of the variation among fields and the variation among plots within a field. In deciding which type of plot to use, the investigator must balance the variability against the cost. He will attempt to select the plot that will give the desired degree of reliability for the lowest cost, although other factors (for example, personnel considerations) may force him to choose one that is quite the best in terms of costs and variances. Experience has shown that in almost all cases, the variation among fields is considerably greater than variation within fields. As a result, the number of plots selected within each sample field should be small so that the available resources can be more efficiently expended on sampling as many different fields as possible. In fact, in some investigations, the optimum number of plots has been only one per field.5 A minimum of two plots is necessary, of course, if one wishes to estimate the within-field variability from the sample; nevertheless, the investigator may choose to have only one plot per field if the within-field component of variance is very small compared with the between-field component. 14.4.3 Size and shape of plot Circular, triangular, square, and rectangular plots have all been used in past studies for crops that are scattered in the field or planted in very closely spaced rows (for example, small grains or hay). For crops in widely spaced rows (for example, maize or cotton), rectangular plots are the logical choice; the width is often designated in terms of rows and the length in terms of feet (or meters, etc.). Along with the shape of the plot, a method of marking it must be specified. Rigid frames or other devices have been used successfully for marking small plots. Ropes, chains, etc., are easier to transport but are more difficult to place in the field if the worker has to measure and drive stakes at the corners, etc. For a triangular plot, a closed chain with rings at the three vertices can be used quite easily; the same device, provided it forms a right triangle, can also be used to mark rectangular plots using a suitable combination of triangles. Large plots are usually laid out using pegs or stakes, string, and a measuring tape. As the size of the plot increases, the variability among plots decreases; however, since the withinfield contribution to the overall variance is usually negligible relative to the other sources of variance, small plots are usually preferred from a practical standpoint. One man can usually do the work alone, he can place a portable frame much faster than he can stake out a large plot, he can harvest more quickly, and he has less material to handle. Unfortunately, experience has shown that small plots almost always produce seriously biased estimates. The reasons for this are not entirely clear, but it appears that two factors are largely responsible: (1) 5. In locating the plot in the field, it is much easier for the field worker to allow the condition of Theoretically, the optimum number of plots need not be an integer. As a practical matter, of course, the theoretical result must be rounded to an integer. 258 the crop to influence the precise location of the smaller plot. (2) The problem of whether to count plants on the boundary as being in or out of the plot is more critical with the smaller plot, since the perimeter of a small plot is greater relative to its area than is the perimeter of a large plot. The general tendency appears to be to include plants that should be excluded and, thus, to consistently overestimate the yield. For a smaller plot, even a single plant erroneously included can seriously affect the results. 14.4.4 Locating the plot in the field Many different procedures have been proposed for locating plots in the field. Whatever method is used, it is important that the field staff understand clearly how it should be done, and checks should be made to see that they are following the instructions. Otherwise, subjective bias on the part of the field worker will almost certainly enter into the procedure. Ideally it would be desirable to divide the entire field into plots of the size and shape decided upon and select the required number of plots at random. However, this is not usually practicable. A method that has been used and is practicable whenever the field is rectangular (or can be conveniently enclosed in a rectangle) is to locate points at random within the field; the sample plots are then laid out in a prescribed manner about these points. For each plot to be located, the procedure is as follows: (1) The field worker selects a random number x between 0 and n1, where n1 represents the total length of one dimension of the field (or of the enclosing rectangle); he selects another random number y between 0 and n2, where n2 represents the total length of the other dimension. For a row crop, the first dimension would usually be expressed in terms of the number of rows.6 In other cases, the dimensions would be expressed in terms of units, such as meters, or in terms of steps or paces. (2) Starting at a predetermined corner, the field worker measures or paces (or counts rows) the distance x along the appropriate side of the field (or of the enclosing rectangle); then at right angles to this side, he measures or paces the distance y into the field. (3) If the worker is still within the boundaries of the field, he marks the random point (for example, by digging with his heel and driving a stake). If he is not within the boundaries of the field (he would, of course, be within the enclosing rectangle), he uses another pair of random numbers and repeats the process. (4) From this point, the field worker lays out the plot. If the plot is to be circular, the random point should be used as the center. If it is to be triangular or rectangular, the point should be used to locate a predetermined vertex or corner; this vertex or corner is usually chosen so that the plot will extend away from the random point in the direction that the worker has been walking. Figure 2 on the following page illustrates this procedure. In this example the point (x1, y1) falls inside the field and is accepted. The point (x2, y2) falls outside the field and is rejected. From the 6. The random number would then be selected between 1 and the total number of rows in the field (n1 ) 259 sample point, the plot would usually extend upward and to the right. One difficulty in this scheme is that it allows plots to overlap field boundaries; any of the several feasible rules that can be used in such cases present certain problems. Consider, for example, a field of maize 200 rows wide and 100 meters long. Suppose that the plot is to be 4 rows wide by 6 meters long. Suppose further that the selected row coordinate is 198 and the length coordinate is 95. From the point of intersection of the coordinates, the plot would extend 1 meter and 1 row beyond the boundaries of the field (the plot starts at the end of te 95th meter but includes row 198). Possible rules that could be adopted to take care of this situation include: (1) Instruct the worker to harvest only the partial plot 3 rows by 5 meters and, of course, to record these dimensions on his form. Using the proper inflation factor, an unbiased estimate of the yield for this field could be made. In this example, this procedure could be carried out rather easily; however, if the field were irregular in shape or the plot were circular or triangular, the worker might find it difficult to estimate the portion of the plot in the field. (2) Instruct the worker to think of the rows as being numbered in a circular manner and similarly the length. Thus, in this example, row 1 would be the fourth row of the plot and the first meter in each row would be taken to finish out the length of the plot. This, too, would be an unbiased 260 procedure. It would, however, not be practicable for anything except rectangular plots in regularly shaped fields. Furthermore, it might be difficult to explain it to the average field worker. Finally it does not fit into the usual concept of a plot as a contiguous piece of land. (3) Instruct the worker to restrict his random selection to numbers that will not allow this situation or, equivalently, to reject plots found to overlap boundaries and select another set of coordinates. In this case, in the example, he could do the former by restricting the selection for rows to numbers between 1 and 197 and for length to numbers between 0 and 94. This procedure is clearly biased since the edges of the field (in the example, the first and last four rows and the first and last six meters) have less chance of being in the sample than does the remainder of the field. If the yield tends to be greater or smaller than average around the edges of the field, estimates of yield based on this method will be biased. However, this is the simplest procedure. If the borders of the field are small in area relative to the remainder of the field or if there is no reason to believe that the yield is different along the edges, this method can be recommended in preference to unbiased but more difficult procedures. 14.4.5 Harvesting procedure If the plots are small, the field worker will probably do the work himself, cutting the crop and weighing it in the field. He will than take a small subsample to be sent to the central office for drying. (It is always a good practice to return the remainder of the produce to the holder.) If plots are large enough, it may be desirable to harvest them by the same method that the holder will use in the regular harvest and, if possible, at the same time. This will require his cooperation and help. 14.4.6 Adjustment to actual production The technician's method of harvesting small plots and processing the produce usually gives a higher rate of yield than does the normal harvesting procedures used by the holder because of greater harvesting losses in the normal methods. For some crops, these losses are substantial. In addition, it is not possible to harvest all plots on or immediately before the harvest date. If the worker waits too long to start harvesting, he will almost certainly find some fields harvested before he arrives; consequently, he will need to start harvesting plots in some fields while the crop is immature. Both of these factors will cause biased estimates if adjustments are not made. (The harvesting of small plots measures what is often referred to as biological yield.) One method of adjustment is to select a subsample of fields of known area and harvest them for the holders, using the normal procedures. This provides a basis for adjusting the data collected from the harvested plots. A similar method appropriate for some crops (for example, hay crops that are taken from the field in the form of bales) is to arrange to weigh te entire crop in a subsample of fields as the holder transports it from the harvested field, but allowing the holder to harvest it whenever and however he wishes. Another method of adjustment is to carry out a gleaning operation after harvest to estimate field losses directly. The estimated field losses per unit area are then subtracted from the estimated biological yield to get the actual yield. This procedure has the advantage of not requiring the worker to be present at the harvest--an important consideration since several holders of different sample fields may all decide to harvest on the same day. Unfortunately, experience has shown that the 261 problems of estimating field losses are fully as great as those of estimating the original biological production. As already mentioned, it is desirable that sample plots be harvested as near as possible to the date the remainder of the field is harvested; however, this cannot always be accomplished for all fields. One object of a pilot study would be to determine what adjustments, if any, must be made for differences between these harvesting dates. For many crops, no adjustment is necessary because the crop has essentially completed its growth before either date and is then in the process only of losing moisture. An additional adjustment that must be made is for moisture content. A procedure commonly used is to dry the material from the plots (or a subsample of it) until it is at or very near to 0% moisture content and then to weigh it. This so-called dry weight can then be adjusted to any moisture content desired. For many crops, a standard moisture content has been specified. If the dry material is only a subsample of the plot, a two-step process is required. The material from the entire plot and the subsample must be weighed separately in the field immediately after cutting. The subsample is then dried and weighed. The dry weight of the entire plot can then be estimated using the ratio of dry to wet weight of the subsample. 14.4.7 Operational considerations Before an extensive program to measure yields objectively can be put into operation, numerous practical problems must be solved. These include the availability of labor, the availability of facilities for drying the crops, equipment needs, the need to coordinate the activities of the workers with the holders' plans for harvesting their crops, etc. The problem of timing can be very difficult, particularly when the crop is likely to be ready for harvest at the same time over a wide area. As stated previously, one important reason for conducting pilot studies is to obtain information about these practical problems. 262 Study Assignment Problem A. The sketch below simulates a segment outlined on an aerial photo. The segment contains a total of 100 hectares divided into categories according to the of the land. The categories are: uses made Crop land: A1 - maize A2 - wheat A3 - other crop land B - grassland C - forest D - wasteland A grid of 36 dots has been placed over the segment to be used in estimating the by categories of use. amount of land Exercise 1. Estimate the number of hectares in this segment that are used for crop land. Exercise 2. Estimate the number of hectares for grassland. Exercise 3. Estimate the number of hectares in forest and wasteland. Exercise 4. Estimate the proportion of crop land used for maize. In what basic way does this estimate differ from those in exercises 1 to 3? Problem B. In the sketch above, marks on the east and west boundaries of the segment subdivide the boundaries into 40 units. Using these marks as guides, place two lines at random across the segment parallel to the north and south boundaries. Exercise 5. Use these parallel lines to estimate the quantities estimated in Problem A. 263 Exercise 6. For each quantity, compile the distribution of the estimates obtained by several trials or several persons. Problem C. The sketch below shows a field bordering on a river. Exercise 7. Draw a circle around the corner corresponding to the unique point according to each of the definitions given below. Place the appropriate letter (a, b, c) by each circle. (a) Northwest corner - Identify those boundary points lying farthest north. The northwest corner is the most western of these points. (b) Northwest corner - Identify those boundary points lying farthest west. The northwest corner is the most northern of these points. © Southwest corner - Identify those boundary points lying farthest south. The southwest corner is the most western of these points. Problem D. Data on the total area of crop land harvested has been obtained by interview from a simple random sample (selected without replacement of 24 holdings out of a population of 96 holdings. Objective measurements have been carried out on a subsample of 8 of these holdings selected at random without replacement. The data are shown in the table below. Unit Hectares of crop land harvested Interview (Y) Objective measurement (X) 1 14 14.4 2 79 - 3 46 - 4 112 116.1 5 46 - 6 92 - 7 29 - 8 40 41.9 9 12 - 10 78 80.4 11 66 264 12 43 - 13 39 - 14 91 93.9 15 17 16.8 16 68 - 17 100 - 18 87 - 19 74 75.4 20 64 - 21 78 - 22 40 42.6 23 22 - 24 55 - Exercise 8. Estimate the total crop land harvested using the interview data only. Estimate the variance of this estimated total. Exercise 9. Estimate the total crop land harvested using the objective measurement data only. Estimate the variance of this estimate. Exercise 10. Using the formulas given below, estimate the total crop land harvested and the variance of this estimate using both types of data and ratio estimation. where n1 = size of large interview sample n2 = size of objective measurement subsample 265 266 SELECTED LIST OF REFERENCES 1. Cochran, William G. Sampling Techniques. Second edition. New York, John Wiley and Sons. 1963. 2. Food and Agriculture Organization of the United Nations (FAO). Estimation of Areas in Agricultural Statistics. Edited by S. S. Zarkovich. Rome, 1965. 3. Food and Agriculture Organization of the United Nations (FAO). Estimation of Crop Yields. By V. G. Panse. Rome, 1954. 4. Food and Agriculture Organization of the United Nations (FAO). By S. S. Zarkovich. Sampling Methods and Censuses. Rome, 1965. Quality of Statistical Data. Rome, 1966. 5. Hansen, Morris H.; Hurwitz, William N.; and Madow, William G. Sample Survey Methods and Theory. New York, John Wiley and Sons, 1953. (Volume I: Methods and Applications; Volume II: Theory) 6. Kish, Leslie. Survey Sampling. New York, John Wiley and Sons, 1965. 7. Kniceley, Maurice R. Probability Sampling for Surveys and Censuses, Course Notes, PSDP, 1985. 8. Megill, David J. Preliminary Recommendations for Designing the Master Frame for the Senegal Intercensal Household Survey Program, U.S. Bureau of the Census, November 1990. 9. Neter, John and Wasserman, William. Fundamental Statistics for Business and Economics. Boston, Mass., U.S.A., Allyn and Bacon, 1961. 10. Sampford, M. R. An Introduction to Sampling Theory. Edinburgh and London, Oliver and Boyd, 1962. 11. Sukhatme, Pandurang V. Sampling Theory of Surveys with Applications. Ames, Iowa. U.S.A., The Iowa State College Press, 1953. New Delhi, India, The Indian Society of Agricultural Statistics, 1953. 12. The RAND Corporation. A Million Random Digits. Glencoe, Illinois, U.S.A., The Free Press, 1955. 13. United Nations.Statistical Office. Handbook of Household Surveys: A Practical Guide for Inquiries on Levels of Living. New York, 1964. (Studies in Methods, Series F, No. 10) 14. U.S. Bureau of the Census. The Current Population Survey Reinterview Program, Some Notes and Discussion. Washington, D.C., U.S. Government Printing Office, 1963. (Technical Paper No. 6) 267 15. U.S. Bureau of the Census. The Current Population Survey--A Report on Methodology. Washington, D.C., U.S. Government Printing Office, 1963. (Technical Paper No. 7) 16. U.S. Department of Commerce. Statistical Abstract, Washington, D.C., U.S. Government Printing Office, 1981, Table 202, P. 123. 17 Yates, Frank. Sampling Methods for Censuses and Surveys. Third Edition. New York, Hafner Publishing Company, 1960. 268 Annex A GLOSSARY OF TERMS Accuracy: Quality of survey result as measured by the closeness of the survey estimate to the exact or true value being estimated. The accuracy is affected by both sampling error and bias. Allocation of sample: The method used in determining how the sample should be distributed. In stratified, cluster sampling, it usually refers to the number of clusters to be allocated to each stratum and the size of sample selected from each cluster. Area sample: A type of sample (usually a multistage sample) in which the sampling units are individual land areas (segments) which can be defined on a map. The segments cover the entire area to be included in the survey; the segments do not overlap; and, in most applications, the boundaries of each segment must be clearly defined so they can be recognized and identified by enumerators in the field. Often the segments are clusters of the units of analysis; for example, clusters of farms or housing units. Each unit of analysis must be associated with one and only one segment. Attribute: See also ‘Characteristic.’ Quality or characteristic. This term is also used in reference to the proportion of units having a certain characteristic. Benchmark statistics: Statistics that provide information against which one can measure or compare changes. Bias: The difference between the expected value of an estimator and the true population value being estimated. When the bias is equal to zero, the estimator is said to be “unbiased.” The term bias is also generally used to designate an effect which deprives a statistical result of representativeness by systematically distorting it, as distinct from a random error which may distort on any one occasion but balances out on the average. Bounded recall: An interview where the respondent is reminded of what he reported in an earlier interview and is then asked only to report on any new events that occurred subsequent to the bounding interview. This method is usually used in “income and expenditures surveys, “ in which at the beginning of the bounded interview (the second and subsequent interviews), the respondent is told about the expenditures reported during the previous interview, and is then asked about additional expenditures made since then. Bounding: Prevention of erroneous shifts of the timing of events by having the enumerator or respondent supply at the start of the interview (or in a mail survey) a record of events reported in the previous interview. Census: Data collection program through which attempts are made to collect information about every element (person, household, farm, etc.) in the population. Characteristic: A variable having different possible values for different individual units of sampling or analysis. In a sample survey, we observe or measure the values of one or more characteristics for the units in the sample. For example, we observe (or ask about) the area of land in rice, or the number of cattle on a farm. Classification Errors: Errors caused by conceptual problems and misinterpretations in the application of classification systems to survey data. Cluster sample: A system of sampling in which the units of analysis of the population are considered as grouped into clusters, and a sample of clusters is selected. The selected clusters then determine the units to be included in the sample. The sample may include all units in the selected clusters or a subsample of units in each selected cluster. Clusters: See also ‘Cluster sample.’ Small groups into which a population is divided to facilitate the data collection. The groups generally are defined so as to help break a large survey area into workload-sized chunks and/or to reduce travel and administrative costs. Ideally, the units in a cluster should be as heterogeneous as possible. Coding: Coding is a technical procedure for converting verbal information into numbers or other symbols which can be more easily counted and tabulated. Coding error: Error that occurs during the coding of sample data. The assignment of an incorrect code to a survey response. Coefficient of variation: The relative standard error; that is, the standard error as a proportion of the magnitude of the estimate. The population coefficient of variation is denote by CV, which is estimated from a sample by the cv. The coefficient of variation of estimates, such as the mean, proportion, or total, is denoted by CV(). The estimate of interest is then placed inside the parenthesis. If is an estimate of the population parameter 2 , then coefficient of variation of the estimate and denotes the true is an estimate of Conditioning effect: The effect on responses resulting from the previous collection of data from the same respondents in recurring surveys. Confidence interval: A range above and below the estimated value which may be expected to enclose the true value with a known probability, assuming no bias. Consistent estimate: An estimate of a type that (while possibly biased) approaches more and more closely the true value being estimated as the size of sample increases, the most common example being a ratio estimate. Content error: Error of observation or objective measurement, of recording, of imputation, or of other processing which results in associating a wrong value of the characteristic with a specified unit. Coverage error: The error in an estimate that results from (1) failure to include in the frame all units belonging to the defined population; failure to include specified unis in the conduct of the survey (undercoverage), and (2) inclusion of some units erroneously either because of a defective frame or because of inclusion of unspecified units or inclusion of specified units more than once, in the actual survey (overcoverage). Cost function: A mathematical expression showing the cost of conducting a survey in terms of the sample sizes and unit costs. Editing: Preliminary step in which the responses are inspected, corrected and sometimes precoded according to a fixed set of rules. Efficiency: A comparative measure of one sample design relative to another with respect to amount of precision produced per unit of cost for a given sample size. Element: See Unit of analysis. Elementary Unit: See Unit of analysis. Estimate: A numerical quantity calculated from sample data and intended to provide information about an unknown population value. Estimating formula: A mathematical formula used to calculate an estimate. Estimator: See Estimating formula. Expected value. The average value of the sample estimates over all possible samples. Finite Population Correction Factor (fpc): It’s a factor that corrects the value of the variance when the sample size is large with respect to the size of the population. Frame: A list of units which make up a population. The frame consists of previously available descriptions of the objects or material related to the physical field in the form of maps, lists, directories, etc., from which sampling units may be constructed and a set of sampling units selected; and also information on communications, transport, etc., which may be of value in improving the design for the choice of sampling units, and in the formation of strata. Imputation: The process of developing estimates for missing or inconsistent data in a survey. Data obtained from other units in the survey are usually used in developing the estimate. Independent information: Data known in advance or simultaneously with the survey, which are not based on the survey but may be used to improve the survey design. Such data may be used for stratifying, deciding on the probabilities of selection, or estimating the final results from the sample data. Interviewer bias: Bias in the responses which is the direct result of the action of the interviewer. Interviewer error: Errors in the responses obtained in a survey that are due to actions of the interviewer. Interviewer variance: The component of the nonsampling variance which is due to the different ways in which different interviewers elicit or record responses. Intracluster or intraclass correlation: A measure used to estimate the degree of homogeneity (or heterogeneity) between elementary units within a cluster. It can be used to determine how satisfactorily clusters have been formed. For example, the closer the value is to zero (or negative) the more unlike the elementary units are and, consequently, the better we’ve done to form clusters. We also could use this to evaluate how effectively we have created the strata. Item nonresponse: The type of nonresponse in which some questions, but not all, are answered for a particular unit. The type of nonresponse in which a question is missed for an interviewed unit. List: A population in which the sampling units have been numbered or otherwise identified; the list of units can be the basis for the selection of a sample. See also Sampling Frame. Mean square error: A measure of the accuracy of an estimate or the extent to which an estimate from sample data differs from the true population value being estimated. If the estimates are unbiased, the mean square error is equivalent to the variance. Muitiframe sampling: The use of two or more sampling frames to select a survey sample. Generally necessary when the usual frame, such as an address register, will not adequately cover the population and/or there are unique or unusually large units that must appear in the sample. Multistage sampling: The most common type of cluster sampling. In this method, a sample of clusters is selected; and then a subsample of units selected within each sample cluster. If the subsample of units is the last stage of sample selection, it is called a two-stage sample design (although each such unit may contain more than one unit of analysis, as in an area sample). If the subsample is also a cluster from which units are again selected, it is a three-stage design, or fourstage design, etc. Noninterview: The type of nonresponse in which no information is available from occupied sample units for such reasons as: not at home, refusals, incapacity and lost questionnaires. Noninterview adjustment: A method of adjusting the weights for interviewed units in a survey to the extent needed to account for occupied sample units for which no information was obtained. Nonsampling error: The error in an estimate arising at any stage in a survey from such sources as varying interpretation of questions by enumerators, unwillingness or inability of respondents to give correct answers, nonresponse, improper coverage, and other sources exclusive of sampling error. This definition includes all components of the Mean Square Error (MSE) except sampling variance. Optimum allocation of sample: Refers to the selection of a sample in such a way as to produce the minimum standard error for a constant sample size or for a constant cost. It is used in both stratified sampling and cluster sampling. Overhead costs: Costs that are fixed and do not affect overall costs. These do not enter into designing the sample. Included are such costs as administrative, rent, equipment, printing, and utilities. Parameters: These are values descriptive of the population distribution and calculated from all population units. They are estimated from a sample, the estimates being called statistics. For normal distributions the parameters are the mean and standard deviation. Population: Any clearly defined set of units (or elements) for which estimates are to be made. The elements can be persons, farms, households, blocks, counties, businesses, and so on. Most of our discussion deals with sampling from a finite population, containing a finite number of elements. Precision: Difference between the sample estimate and a complete count value collected under the sample conditions. This is measured by the sampling error or relative sampling error. Primary sampling unit (PSU): The units making up the sampling frame for the first stage of a multistage sample. Probability of selection: The chance each unit has of being selected in the sample. This is known prior to sample selection. Probability proportionate to size (PPS): A method of sample selection in which units are selected with unequal probability of selection, the probability for each unit being proportionate to a measure of size. The measure of size for a unit is a number assigned to that unit in advance of selection, which is believed to be highly correlated with the statistics to be estimated. Probability proportionate to size is frequently abbreviated to PPS. Proportion: Measure of the relative frequency of units that possess a certain characteristic in the population or sample. Proportionate stratified sampling: A system of selecting a stratified sample in which the same probability of selection is used in each stratum. Reliability: The confidence that can be assigned to a conclusion of a probabilistic nature. Response bias: The difference between the average of the averages of the responses over a large number of independent repetitions of the census and the unknown average that could be measured if the census were accomplished under ideal conditions and without error. The difference between average reported value over trials and true values. It is a combined bias as algebraic sum of all bias terms representing diverse source of biases. Response error: The part of the nonsampling error which is due to the failure or the respondent to report the correct value (respondent error) or the interviewer to record the value correctly (interviewer error). It includes both the consistent response biases and the variable errors of response which tend to balance out. Response variance: That part of the response error which tends to balance out over repeated trials or over a large number of interviewers. The variance among the trial means over a large number of trials. The response variance of a survey estimator is the sum of the simple response variance and the correlated response variance. Response variance, correlated: The correlated response variance is the contribution to the total variance arising from nonzero correlations (in the sense of the distribution of measurement errors) among the response of sample units. The contribution to the total response variance from the correlations among response deviations. Response variance, uncorrelated: The sample response variance contribution to the total variance arises from the variability of each survey response about its own expected value. In terms of a simple random sampling design, the simple response variance is the population mean of the variances of each population unit. The variance of the individual response deviations over all possible trials. The basic trial-to-trial variability in response, averaged over the elements in the population. Rotation bias: A type of bias that occurs in panel surveys which consist of repeated interviews on the same units. Although these surveys are designated so that the estimates of a characteristic are expected to be nearly the same for each panel in the survey, this expectation has not been realized. For example, an estimate from a panel that is in the survey for the first time may differ significantly from estimates from the panels that have been in the survey longer. The downward tendency in the value of the characteristics reported if the observation of the same units is continued over a longer period of time. For example, it was found in expenditure surveys that the average expenditure per item per person is usually higher in the first week of the survey than in the second or the third. Sample: A subset of a population. As used in these chapters, it always refers to a probability sample that is, a sample in which each element in the population has a known probability of selection. Sample design: The sampling plan and estimation procedures. Sample Survey: A data collection program through which information is collected from a probability-selected subset of the population. Sampling bias: That part of the difference between the expected value of the sample estimator and the true value of the characteristic which results from the sampling procedure, the estimating procedure, or their combination. Sampling Distribution: The distribution of values of a statistic calculated from all possible samples of the same size from the same population. Sampling Error (of Estimator): That part of the error of an estimator which is due to the fact that the estimator is obtained from a sample rather than a 100 percent enumeration using the same procedures. The sampling error has an expected frequency distribution for repeated samples, and the sampling error is described by stating a multiple of the standard deviation of this distribution. That part of the difference between a population value and an estimator thereof, derived from a random sample, which is due to the fact that only a sample of values is observed; as distinct from errors due to imperfect selection, bias in response or estimation, errors of observation and recording, etc. The totality of sampling errors in all possible samples of the same size generates the sampling distribution of the statistic which is being used to estimate the parent value. Sampling frame: The totality of sampling units from which a sample is to be selected. The frame may be a listing of persons or housing units; a file of records; a generalization about the population based on information contained in a sample. Sampling Variance: It is denoted by where denotes any estimator. The term sampling variance refers to the variance of an estimator. For a simple random sample, the variance of the mean is given by: Sampling plan: The actual procedure describing how sample units are to be selected and from which sampling frames. Sampling unit: The units to be selected. These may or may not be the same as the units of analysis. For example, to obtain information on persons, one might use a complete listing in a Census, or a register, and select a sample of persons directly. However, one could also select a sample of households and include in the survey all persons in the selected households. Similarly, one could select complete buildings, and include all persons in the sample buildings. The choice of the most efficient sampling unit is an important consideration in the design of a survey. Sampling with replacement: A sample obtained by first selecting one element of the population, replacing it, then making a second selection and replacing it before making the third selection, etc., until n selections have been made. With this method of selection, a particular unit can be included more than once in the sample--in fact, up to n times. Sampling without replacement: A sample obtained by selecting one element of the population and, without replacing it, selecting one of the remaining elements; then continuing this process until n different selections have been made. With this method, a unit can be included only once in any sample. Self-weighting sample: A sample in which every element in the population has the same chance of selection, although unequal probabilities may have been used at various stages of sampling. For example, clusters may have been selected with PPS; then the sampling within a selected cluster is done in such a way as to give each element in it the same chance of being in the sample as the elements to be selected in other clusters. Simple random sample (also called unrestricted random sample): The simplest type of sampling system. For a sample of size n, each of the possible combinations of n elementary units that may be formed from a population of N units has the same chance of selection as every other combination of n units. Moreover, every element will have the same chance of selection as every other element (chapters 2, 3, 4, and 5). Standard deviation: The standard error of a simple random sample of size 1. Standard error: A measure of the extent to which estimates from various samples differ from their expected value. With a reasonably large sample, the distribution of sample results for all possible samples is approximately the normal distribution, and probability statements can be made about how close the sample can be expected to come to the expected value--the probabilities being expressed in terms of the standard error. The standard error usually is expressed by the Greek letter F or S. See also Variance. Statistic: A quantity computed from sample observations of a characteristic, usually for the purpose of making an inference about the population. The characteristic may be any variable associated with a member of the population; such as age, income, employment status, etc. The quantity may be a total, an average, a median, or other percentile; it may also be a rate of change, a percentage, a standard deviation or any other quantity whose value we wish to estimate for the population. Statistical Inference: A statistical inference is a decision, estimate, prediction, or square of the coefficient of variation. Stratification: The process of dividing a population into groups for the purpose of selecting a separate sample from each group. Each group is usually made as internally homogeneous as possible. The groups are called strata with each one referred to as a stratum. Stratified sampling: The method of sampling from a universe which has been stratified. At least one sample unit must be selected from each stratum, but at least two units are needed to calculate variances. Probabilities of selection can be different from stratum to stratum. Systematic error: As opposed to a random error, an error which is in some sense biased, that is to say, has a distribution with mean (or some equally acceptable measure of location) not at zero. Systematic sampling: A method of sample selection in which the population is listed in some order and every kth element is selected for the sample. Telescoping: The tendency of the respondent to allocate an event to a period other than the reference period (also called border bias). A telescoping error occurs when the respondent misremembers the duration of an event. While one might imagine that errors would be randomly distributed around the true duration, the errors are primarily in the direction of remembering an event as having occurred more recently than it did. This is due to the respondent’s wish to perform the task required of him. When in doubt, the respondent prefers to give too much information rather than too little. Total Error: The difference between an estimate and its true value in the population measured as the root mean square error, that is, the square root of the sum of variable error squared and bias squared. True Value. The value that would be obtained if no mistakes were made or error existed. Ultimate cluster: The totality of units included in the sample from a primary unit. Even if the sample is obtained using different stages of selection, the Primary Sampling Unit becomes the ultimate cluster. Unbiased estimate: A type of estimate having the property that the average of such estimates made from all possible samples of a given size is equal to the true value. Unbouded recall: Ordinary type of recall, where respondents are asked for expenditures made since a given date and no control is exercised over the possibility that respondents may erroneously shift some of their expenditures reports into or out of the recall period. Unit of Analysis: A unit for which we wish to obtain statistical data. The units may be persons, households, farms or business firms; they may also be products resulting from some machine process, etc. Universe: See Population. Variance: The square of the standard error; it is usually written as S2 with a subscript to indicate the statistic to which it refers. The term is usually written without a subscript for the square of the standard deviation. Where there is any possibility of confusion, sampling variance is used for the square of the standard error, and population variance for the square of the standard deviation.