Sample Surveys
Transcription
Sample Surveys
Sample Surveys In this class we will consider the problem of sampling from a finite population. This is usually referred to as a survey. The goal of the survey is to learn about some parameters of a population, like averages or proportions. A well designed survey avoids systematic biases. The two most typical sources of bias are selection bias and non-response bias. Collecting data: Sample Surveys A population is a class of individuals that an investigator is interested in. Examples of populations are 1. All eligible voters in a presidential election. 2. All facebook users that live in Santa Cruz. 3. The female elephant seals that mate at A˜ no Nuevo State Reserve during the winter. 4. All ford focus drivers. A full examination of a population requires a census. This may be impractical in many cases. If only one part of the population is examined, then we are looking at a sample. The goal is to make inferences from the sample to the whole population. Collecting data: Sample Surveys There are usually some numerical characteristics of the population that we are interested in. These are called parameters. For example 1. The average age of eligible voters. 2. The average income of facebook users in Santa Cruz. 3. The proportion of puppies per female elephant seal. 4. The average number of miles driven in a week. Parameters are unknown quantities which are estimated using statistics, which are numbers that can be computed from the sample. The validity of those values depends of how well the sample represents the population. A biased poll Before the 1936 presidential election the The literary Digest, a very prestigious magazine, predicted that Roosevelt will loose the election to Landon obtaining only 43% of the votes. The result of the election was that Roosevelt won by a landslide 62% to 38%. Why was the Literary Digest so wrong? Because their poll was badly designed The Digest based its prediction on a sample of 2.4 million people who responded to a mailed questionnaire that was sent to 10 million people. The names and addresses of these people were obtained from telephone books and club membership lists. The sample had a strong bias against the poor, since they were unlikely to belong to clubs or have phones (in the ’30s). The outcome of the election showed a split that followed a clear economic line: the poor voted for Roosevelt and the rich were with Landon. I Taking a large number of samples when the procedure has a bias does not improve the results Another source of bias in the Digest’s poll is that there was a large number of non-respondents. This produces a non-response bias, since non-respondents can be very different to respondents. Studies have shown that people from the middle class are more likely to respond that people from the upper or the lower classes. So in a survey with a high non-response rate, middle class people may be overrepresented. When considering the quality of a survey keep in mind two possible sources of bias: I Selection bias I Non-response bias Quota Sampling Consider the following scheme to obtain a sample. You send an interviewer to the field and ask him or her to get a fixed number of interviews within certain categories. For example: I Interview 13 subjects I Exactly 6 from the suburbs, 7 from the central city. I Exactly 7 men and 6 women I Of the men, 3 have to be under forty, 4 above forty. I Of the men, 1 has to be black and 6 white. The list of restrictions could go on. The goal is to achieve a sample that is fairly indicative of all demographic and social characteristics of the population to make it representative. This is called a quota sampling scheme. But, at the end, the interviewer has the freedom of deciding who gets interviewed, that is, the ultimate selection is left to human wisdom. Gallup polls were conducted using the quota system for more than a decade, these are the results regarding the Republican vote: Year 1936 1940 1944 1948 Prediction 44% 48% 48% 50% Results 38% 45% 46% 45% Errors 6% 3% 2% 4% The sample sizes are around 50,000. In the 1948 election, Gallup predicted the wrong winner. Gallup had a systematic bias in favor of the Republican candidate in all elections from ’36 to ’48. The reason for the bias is twofold: 1. The sample mimics the population in all possible variables that are controlled for, but there are still other factors that influence the voting behavior of subjects in the sample. 2. There could be an unintentional bias of the interviewers. For example, Republicans might have been easier to interview and so more likely to be picked by an interviewer. Using chance To eliminate the selection bias in a sample we use chance in choosing the individuals to be included in the sample. How does it work? We first set the size of the sample we need. From a list of the subjects in the population (the sample frame) we take one by chance. We delete that subject from the list and take a second subject by chance from the remaining ones. The process continues until we have completed the sample. This is called simple random sampling. The subjects have been drawn at random without replacement. A real poll Using a sample based on chance eliminates selection bias, but a simple random sample can be difficult and costly when the population is large. Also, a simple random sample disregards valuable information about the characteristics of the population. A better idea is to consider a sampling scheme that consists of multiple stages, each one is subject to chance. The Gallup poll after the 1948 election is an example. The poll is taken as follows: 1. The Nation is split in 4 regions: W, MW, NE and S. All population centers of similar size are grouped together. 2. A random sample of the towns is selected. No interviews are conducted in the towns not in the sample. 3. Each town is divided in wards and the wards are subdivided into precincts. 4. Some wards are selected at random within the selected towns. 5. Some precincts are selected at random within the selected wards. 6. Some households are selected at random within the selected precincts. 7. Some members of the selected households are interviewed. This is called a multistage cluster sampling scheme. The results The following table presents the results of Gallup’s predictions for some elections from 1952 to 1992. Year 1952 1960 1968 1976 1984 1992 sample size 5,385 8,015 4,414 3,439 4,089 2,019 Won Eisenhower Kennedy Nixon Carter Reagan Clinton Prediction 51% 51% 43% 49.5% 59% 49.0% Result 55.4% 50.1% 43.5% 51.1% 59.2% 43.2% Error 4.4% .9% .5% 1.6% .2% 5.8% We observe a much smaller error (except for the 1992 election), no bias in favor of the Republican candidate and much smaller sample sizes. Problems Investigators doing polls have to face several problems that can bias the results of the survey even after considering a probabilistic sample. Non-voters: Usually between 30% and 50% of the eligible voters don’t vote. But many of these are tempted to respond affirmatively when asked about their voting intentions. Interviewers ask indirect questions that allow to check if the person is genuinely a voter or not. Undecided: Polls ask questions that give information about the political attitudes of the interviewed person in order to forecast the vote of undecided voters. Response bias: Questions can be posed in a way that bias the response. A useful tool is to have the interviewed person deposit a ballot in a box. Non-response bias: As discussed before, this can create a bias since non-respondents are different from the rest. This is usually corrected by giving more weight to people who are difficult to get, since they, somehow, represent a subpopulation which is closer to the non-respondents. Telephone surveys Conducting a survey by phone saves money. It can also be done in less time. How do you select sample? Phone numbers look like this Area code 415 Exchange 767 Bank 26 Digits 76 The Gallup poll in ’88 used a multistage cluster sample using area codes, exchanges, banks and digits as a hierarchy. The Gallup poll in ’92 was simpler and worked like this: 1. There are 4 time zones in the US. Each zone is divided in 3 types of areas: heavy, medium and lightly populated areas. This produced 12 sampling regions. 2. They sampled numbers at random within each region. Problems Problem 1: A survey organization is planning an opinion survey of 2,500 people of voting age in the U.S.. True or false and explain: the organization will choose people to interview by taking a simple random sample. This is false. Taking a simple random sample of a population of about 200 million voters is impractical. First because a list of all the voters is not available. Second because taking a simple random sample of such a list is a big problem in itself and third because interviewing 2,500 people all scattered around the map will be very costly. Problem 2: A sample of Japanese-American residents in San Francisco is taken by considering the four most representative blocks in the Japanese area of the town and interviewing all the residents in those areas. However, a comparison with Census data shows that the sample did not include a high enough proportion of Japanese with college degrees. How can this be explained? This was not a good way to draw the sample because you would expect that people living in the more traditional areas have very specific characteristics. In particular, it is likely that people with college degrees were living in more suburban neighborhoods. Histograms Histogram of the average length of stay in hospital Information is available from 131 hospitals. We show a histogram of the average length of stay measured in days for each hospital. The area of each block is proportional to the number of hospitals in the corresponding class interval. 6 8 10 In this example all the intervals have the same length. 12 14 length of stay (days) 16 18 20 There are 7 class intervals corresponding to I 6 to 8 days I 8 to 10 days I 10 to 12 days I 12 to 14 days I 14 to 16 days I 16 to 18 days I 18 to 20 days Note that the class that corresponds to 14 to 16 days is empty and that the class with the highest count of hospitals is the one of 8 to 10 days. Income level in $ 0 – 1,000 1,000 – 2,000 2,000 – 3,000 3,000 – 4,000 4,000 – 5,000 5,000 – 6,000 6,000 – 7,000 7,000 – 10,000 10,000 – 15,000 15,000 – 25,000 25,000 – 50,000 50,000 and over percent 1 2 3 4 5 5 5 15 26 26 8 1 Drawing a histogram The starting point of a histogram is a distribution table. Consider the distribution of families by income in the US in 1973. In this table class intervals include the left point, but not the right point. It is important to specify which of the endpoints are included in each class. Notice that in this case class intervals do not have the same length. Once the distribution table is available the next step is to draw a horizontal axis specifying the class intervals. Then we draw the blocks remembering that In a histogram, the areas of the blocks represent percentages So, it is a mistake to set the heights of the blocks equal to the percentages in the table. The area of a rectangle is height×width, A=h·w If we know the percent is equal to the area, and we know the width of the class interval, we find the height by dividing the above equation by width A =h w This means that the height has units of percent per class interval. The table needed to calculate the heights of the blocks looks like Income level in $ 0 – 1,000 1,000 – 2,000 2,000 – 3,000 3,000 – 4,000 4,000 – 5,000 5,000 – 6,000 6,000 – 7,000 7,000 – 10,000 10,000 – 15,000 15,000 – 25,000 25,000 – 50,000 50,000 and over percent 1 2 3 4 5 5 5 15 26 26 8 1 width in $ 1000 1000 1000 1000 1000 1000 1000 3000 5000 10000 25000 percent per $1000 1 2 3 4 5 5 5 5 5.2 2.6 .32 1 2 3 This is the resulting histogram. Notice that the class interval of incomes above $50,000 has been ignored. 0 percent per $1000 4 5 Distribution of family income in the US in 1973 0 10 20 30 income in $1000 40 50 Vertical scale What is the meaning of the vertical scale in a histogram? Remember that the area of the blocks is proportional to the percents. A high height implies that large chunks of area accumulate in small portions of the horizontal scale. This implies that the density or concentration of the data is high in the intervals where the height is large. In other words, the data are more crowded in those intervals. Types of variables A variable is a characteristic that may change from individual to individual in a population. For example, in a survey the questions: how old are you? What is your family’s size? What is your gender? What is your marital status? Correspond to the variables: age, size, sex and marital status. There are different types of variables. Each type is usually handled and analyzed differently. Variables can be classified as: I Quantitative data. Correspond to observations measured on a numerical scale. This can be: I I I Discrete when the values can differ by fixed amounts like in size. This is typical of counts. Continuous differences in values can be arbitrarily small like in age. This is typical of variables that measure physical quantities. Qualitative data. Correspond to observations classified in groups or categories like in sex and marital status. Sometimes the groups have some ordering, as in the case of grades. Of particular importance are binary variables that can take only two values. Classify the following variables: I Records of whether an electrical device is working or not. I The depth of the snow pack at a monitoring station in the Sierras. I The number of students in AMS 5. I The final grade of a student in AMS 5. I The State where a given car is registered. I The ranking of a school within the State school system. I The number of calls to 911 in a given month