Political Science W4365, Design and Analysis of Sample Surveys Class meeting:
Transcription
Political Science W4365, Design and Analysis of Sample Surveys Class meeting:
Political Science W4365, Design and Analysis of Sample Surveys Columbia University, Spring 2013 Class meeting: Mon/Wed 10-11:30, Pupin 425 Section meeting: Time and place to be arranged Instructor: Andrew Gelman Teaching assistant: Tiffany Washburn Course description: Survey sampling is central to modern social science. We discuss how to design, conduct, and analyze surveys, with a particular focus on public opinion polls in the United States. Prerequisites: Basic statistics and regression analysis (for example, Pols 4911, Stat 2024 or 4315, Soc 4075, etc.). Textbooks: - Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., and Tourangeau, R. (2009). Survey Methodology, second edition. Wiley. - Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley. - Gelman, A., and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Also, readings for each week at http://www.stat.columbia.edu/~gelman/surveys.course/ Final exam: Last year’s final is at http://www.stat.columbia.edu/~gelman/surveys.course/final2012.pdf Tentative syllabus: Weeks 1-2: Statistical background Class 1a: Class 1b: Class 2a: Class 2b: Analyzing survey data in R Statistical inference and linear regression Logistic regression The challenge of estimating small effects Topics: Estimates of proportions Estimates and standard errors for continuous parameters Linear regression Logistic regression Translating regressions into real-world predictions Computation in R: proportions and averages, linear and logistic regression, simple graphs, random sampling Stories: The 1/sqrt(N) story 55,000 residents desperately need your help! Arsenic in drinking water in Bangladesh Video: Osorio, F., Gelman, A., and Leeman, L. (2012). Working with the General Social Survey in R. http://www.youtube.com/watch?v=mu2sEf12Eu4 Readings: Gelman and Hill, chapters 2-5. Gelman, A., Lee, D., Dorie, V., and Chan, V. (2011). Statistics: What’s the Difference?, chapters 1-3. Gelman, A., and Stern, H. S. (2006). The difference between “significant” and “not significant” is not itself statistically significant. American Statistician 60, 328-331. Gelman, A. (2012). 1.5 million people were told that extreme conservatives are happier than political moderates. Approximately .0001 million Americans learned that the opposite is true. Statistical Modeling, Causal Inference, and Social Science blog, 14 Aug. http://andrewgelman.com/2012/08/1-5-million-people-were-told-that-extremeconservatives-are-happier-than-political-moderates-approximately-0001-millionamericans-learned-that-the-opposite-is-true/ Homework due beginning of class 3a (problems 1 and 2) and class 4a (problems 3 and 4): 1. Sample size calculation. In a survey of n people, half are asked if they support “the health care law recently passed by Congress” and half are asked if they support “the law known as Obamacare.” The goal is to estimate the effect of the wording on the proportion of Yes responses. How large must n be for the effect to be estimated within a standard error of 5 percentage points? 2. Linear regression. The file at http://www.stat.columbia.edu/~gelman/surveys.course/pew_research_center_june_elect_ wknd_data.dta has data from Pew Research Center polls taken during the 2008 election campaign. You can read these data into R using the read.dta() function (after first loading the “foreign” package into R). For this homework set, ignore the survey weights. Fit a linear regression (using the lm() function in R) to predict political ideology (on a 5point scale: –2 = very liberal, –1 = liberal, 0 = moderate, 1 = conservative, 2 = very conservative, with nonresponses coded as 0’s), given sex, age, and marital status. Use the display() function (after first loading the “arm” package) to display the result. In a short paragraph, describe the meaning of each coefficient in the fitted model. 3. Logistic regression. Using this same survey, fit a logistic regression (using the glm() function in R) to predict whether a person is liberal (that is, responds “liberal” or “very liberal” to the ideology question, excluding respondents who do not respond to this question), given sex, age, and marital status. Use the display() function to display the result. In a short paragraph, describe the meaning of each coefficient in the fitted model. 4. Working with survey data in R. Using this same survey, compute the percentage of respondents in each state (excluding Alaska and Hawaii) who are liberal. Make the following three graphs, putting them into a single image, hwk1_4.png, using the following commands in R: png (“hwk1_1.png”, height=600, width=600) par (mfrow=c(2,2)) ... dev.off () (a) Plot estimated proportion liberal in each state vs. Obama's vote share in 2008 (data available at http://www.stat.columbia.edu/~gelman/surveys.course/2008ElectionResult.csv, readable in R using read.csv()), as a scatterplot using the two-letter state abbreviations (see state.abb() in R). (b) Plot estimated proportion liberal in each state vs. sample size in each state (again as a scatterplot using the two-letter state abbreviations). (c) Map estimated proportion liberal using colors in a U.S. map. Weeks 3-4: Missing data and adjusting for known differences between sample and population Class 3a: Class 3b: Class 4a: Class 4b: Missing-data imputation Survey nonresponse Weighting and poststratification Ratio and regression estimation Topics: Imputing missing values Weighting Poststratification Stories: Alcoholics Anonymous survey Exit polls and election-night results Readings: Gelman and Hill, chapter 25. Groves et al., chapter 6. Lumley, chapters 5 and 9. Gelman, A. (2006). Counting churchgoers. Statistical Modeling, Causal Inference, and Social Science blog, 11 Jul. http://andrewgelman.com/2006/07/counting_church/ Hadaway, C. K., Marler, P. L., and Chaves, M. (1993). What the polls don’t show: A closer look at U.S. church attendance. American Sociological Review 58, 741-752. Smith, T W. (1983). The hidden 25 percent: An analysis of nonresponse on the 1980 General Social Survey. Public Opinion Quarterly 47, 386-404. Gelman, A. (2007). Struggles with survey weighting and regression modeling (with discussion). Statistical Science 22, 153-164. Lohr, S. (2007). Comment: Struggles with survey weighting and regression modeling. Statistical Science 22, 175-178. Gelman, A. (2007). Rejoinder: Struggles with survey weighting and regression modeling. Statistical Science 22, 184-188. Homework due beginning of class 5a (problems 1 and 2) and class 6a (problems 3 and 4): 1. Weighted analysis. Using the Pew surveys from the previous homework: (a) Compute the weighted average proportion liberal in each state and plot vs. the raw average; this should be a square plot (in R, par (pty=”s”)) with identical scales on x and y axes, and each state indicated by its two-letter abbreviation. (b) Using the “survey” package in R, fit a linear regression (using the svyglm() function in R) to predict political ideology, given sex, age, and marital status. Compare to the unweighted results. 2. Poststratification. A survey is taken of 100 undergraduates, 100 graduate students, and 100 continuing education students at a university. Assume a simple random sample within each group. Each student is asked to rate his or her satisfaction (on a 1–10 scale) with his or her experiences. Write the estimate and standard error of the average satisfaction of all the students at the university. Introduce notation as necessary for all the information needed to solve the problem. 3. Missing-data imputation. Create a miniature version of the 2010 General Social Survey (http://www.thearda.com/Archive/Files/Codebooks/GSS10PAN_CB.asp), including the following variables: sex, age, ethnicity (use four categories), urban/suburban/rural, education (use five categories), political ideology (on a 7-point scale from “extremely liberal” to “extremely conservative”), and general happiness. (a) Fit a logistic regression on whether respondents feel “not too happy,” given the other variables in the dataset. Display (using display()) the results for the logistic regression fit to the complete cases (this is the result if you just feed the data including NA’s into R). (b) Impute the missing values using mi() in the “mi” package in R. Then take one of the completed datasets and fit and display a logistic regression as above. (c) Repeat, this time imputing using aregImpute() in the “Hmisc” package. (e) Briefly discuss the differences between the four inferences above. 4. Ratio and regression estimation. Exercise 5.3 from Lumley: Using the data from Wave 1 of the 1996 SIPP panel (see Lumley Figure 3.8): (a) Estimate the ratio of population totals for monthly rent (“tmthrnt”) and total household income (“thtrninc”) over the whole population and over the subpopulation who pay rent. (b) Compute the individual-level ratio, i.e., the proportion of household income paid in rent, and estimate the population mean over the whole population and the subpopulation who pay rent. Weeks 5-6: Sampling and estimation Class 5a: Class 5b: Class 6a: Class 6b: Simple and stratified random sampling Cluster sampling with equal cluster sizes Cluster sampling with unequal cluster sizes Inference for regression coefficients Topics: Stratified sampling Cluster sampling Estimating population averages and totals Estimating regression models Stories: Sampling names and addresses Postal surveys Readings: Groves et al., chapters 3-4. Lumley, chapters 1-6. Gelman and Hill, chapters 7-8. Carlin, J. B., Stevenson, M. R., Roberts, I., Bennett, C. M., Gelman, A., and Nolan, T. (1997). Walking to school and traffic exposure in Australian children. Australian and New Zealand Journal of Public Health 21, 286-292. Homework due beginning of class 7a (problems 1 and 2) and class 8a (problems 3 and 4): 1. Random sampling and regression. Sample 100 random data points x from the normal distribution with mean 10 and standard deviation 5. Then simulate 100 data points y from the model, y = 2 + 10x – x2 + error, where the errors are normally distributed with mean 0 and standard deviation 1. (a) Fit a linear regression to the data and fit a quadratic regression to the data. Display the fitted regressions (using the display() function). (b) Make a scatterplot showing the data (using plot()) and the fitted linear and quadratic regression lines (using curve(a+b*x,add=TRUE) and curve(b0+b1*x+b2*x^2,add=TRUE)). 2. Cluster sampling. Suppose you have a library of 100 books and you want to estimate the frequency of the different words in this library. So you decide to take a random sample of 1000 words. Come up with a sampling scheme in which all words are equally likely to be selected (in proportion to their total number of appearances in the library). 3. Simulation and analysis of stratified sample. Write an R function to take a random subsample of the 2010 General Social Survey using regions of the country as strata. (a) Perform a sample of size 100 with each stratum sampled in proportion to its population size (in this case, the “population” is just the full 2010 GSS). Use this subsample to estimate the proportion of people who favor a law which would require a person to obtain a police permit before he or she could buy a gun. Also compute the standard error for this estimate, first directly using the formula for the standard error of a cluster sample, then using the “survey” package in R. (These two standard errors should be identical.) (b) Put step (a) above in a loop and do it 100 times. Check that your estimate is unbiased and that its standard deviation is approximately equal to the average standard error computed in the 100 simulations. 4. Simulation and analysis of cluster sample. Write an R function to take a random subsample of the 2010 General Social survey using occupations as clusters. (a) Take a cluster sample in the following way: first sample 20 occupations at random, then sample 50% of the respondents from each sampled occupation. From this sample, estimate the proportion of people in the population who favor a law which would require a person to obtain a police permit before he or she could buy a gun. Compute the standard error of this estimate. (b) Repeat (a), but this time taking the sample as follows: first sample 20 occupations at random, then sample 5 people from each sampled occupation (or, if there are fewer then 5 people with that occupation category, sample all of them). Again get an estimate and standard error for the gun control question. (c) Repeat (a), but this time first sample 20 occupations with probability proportional to size, then sample 5 from each sampled occupation (or, if there are fewer then 5 people with that occupation category, sample all of them). Again get an estimate and standard error for the gun control question. Weeks 7-8: Measurement Class 7a: Class 7b: Class 8a: Class 8b: Survey interviewing Challenges in survey measurement Using surveys to answer questions in political science Conducting a survey in the real world Topics: Observational measurement Experimental measurement Survey interviewing Statistical models for measurement error Item response theory Stories: Framing effects The U-shaped pattern on happiness Measuring gun use Measuring religiosity Readings: Groves et al., chapters 7-9. Tversky, A., and Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science 211, 453-458. Gelman, A. (2010). Age and happiness: The pattern isn't as clear as you might think. Statistical Modeling, Causal Inference, and Social Science blog, 26 Dec. http://andrewgelman.com/2010/12/age_and_happine/ Frijters, P., and Beaton, T. (2008). The mystery of the U-shaped relationship between happiness and age. Working paper #26. National Centre for Econometric Research, Australia. Blanchflower, D. G., and Oswald, A. (2008). Is well-being U-shaped over the life cycle? Social Science & Medicine 66, 1733-1749. Stone, A. A., Schwartz, J. E., Broderick, J. E., and Deaton, A. (2010). A snapshot of the age distribution of psychological well-being in the United States. Proceedings of the National Academy of Sciences USA 107, 9985-9990. Gelman, A. (2010). God, guns, and gaydar: The laws of probability push you to overestimate small groups. Statistical Modeling, Causal Inference, and Social Science blog, 12 Jul. http://andrewgelman.com/2010/07/god_guns_and_ga/ Hemenway, D. (1997). The myth of millions of annual self-defense gun uses: a case study of survey overestimates of rare events. Chance 10 (3), 6-10. Homework due beginning of class 9a (problem 1) and class 10a (problem 2): 1. Survey interviewing. Design a survey form and try it out on five friends. 2. Survey measurement. Find a measurement effect in an existing survey. Weeks 9-10: Surveys in political science Class 9a: Voting Class 9b: Public opinion Class 10a: Political participation Class 10b: Understanding and displaying data Topics: General Social Survey and the National Election Study Mass-media opinion polls Other surveys that are available for research Manipulating data and performing simple analyses using R Key ideas of sampling and surveys Stories: Uniform swing in opinions Comparing health care attitudes in 1994 and 2009 What’s (not) the matter with Portugal? Video: Nanjiani, K. (2008). Cheese Heroin. http://www.youtube.com/watch?v=WVIC2gJTD9s Readings: Groves et al., chapters 1 and 2. Page, B. I., and Shapiro, R. Y. (1982). Changes in Americans’ policy preferences, 19351979. Public Opinion Quarterly 46, 24-42. Page, B. I., Shapiro, R. Y., and Dempsey, G. R. (1987). What moves public opinion? American Political Science Review 81, 23-43. Shapiro, R. Y., and Page, B. I. (1988). Foreign policy and the rational public. Journal of Conflict Resolution 32, 211-247. Gelman, A., and King, G. (1993). Why are American Presidential election campaign polls so variable when votes are so predictable? British Journal of Political Science 23, 409-451. Baldassarri, D., and Gelman, A. (2008). Partisans without constraint: Political polarization and trends in American public opinion. American Journal of Sociology 114, 408-486. Gelman, A. (2009). Debunking the so-called Human Development Index of U.S. states. Statistical Modeling, Causal Inference, and Social Science blog, 20 May. http://andrewgelman.com/2009/05/20/debunking_the_s/ Gelman, A. (2008). Peeking behind the curtain, or, What’s (not) the matter with Portugal? Statistical Modeling, Causal Inference, and Social Science blog, 25 Mar. http://andrewgelman.com/2008/03/peeking_behind/ Gelman, A., and Cai, C. J. (2008). Should the Democrats move to the left on economic policy? Annals of Applied Statistics 2, 536-549. Gelman, A. (2012). College football, voting, and the law of large numbers. Statistical Modeling, Causal Inference, and Social Science blog, 25 Oct. http://andrewgelman.com/2012/10/25/college-football-voting-and-the-law-of-largenumbers/ Homework due beginning of class 11a (problem 1) and class 12a (problem 2): 1. Survey responses. From 1984 through 2008 (and maybe in other years), the National Election Study asked attitudes on several issues, and also perceptions of the stances on these issues held by the major presidential candidates. (For example, in 2004 these issues included the role of women, gun-control policy, government aid to African Americans, the level of spending that the government should undertake in the economy, the role of the government in providing an economic environment where there is job security, and the level at which the government should spend on defense. Each respondent was asked how he or she stood on these issues and where they would place George W. Bush and John Kerry.) It turns out that attitudes about the candidates’ views are strongly correlated with respondents’ own ideologies; see Figure 5 of Gelman and Cai (2008). (a) Replicate Figure 5 of Gelman and Cai (2008) using the 2004 NES. Gelman and Cai (2008) had a serious coding error, so your results should be different from theirs. (b) Discuss any difficulties you have, and compare your results to the published paper. 2. Social science. Write a three-page mini-paper addressing some interesting social science question using the NES or GSS. The topic and the analysis do not need to be deep, but they must be original, and you need to go beyond simple toplines and crosstabs. Weeks 11-12: More elaborate statistical modeling Class 11a: Class 11b: Class 12a: Class 12b: Bayesian inference Ideal-point modeling Multilevel regression and poststratification Challenges in multilevel regression and poststratification Topics: Bayesian inference Ideal-point modeling Multilevel regression for stratified and cluster sampling Multilevel regression and poststratification Stories: Voting Health care Gay rights Readings: Gelman and Hill, chapters 11-12. Gelman, A. (2006). Multilevel modeling: What it can and cannot do. Technometrics 48, 432-435. Gelman, A. (2005). Two-stage regression and multilevel modeling: a commentary. Political Analysis 13, 459-461. Gelman, A. (2012). Is it meaningful to talk about a probability of “65.7%” that Obama will win the election? Statistical Modeling, Causal Inference, and Social Science blog, 22 Oct. http://andrewgelman.com/2012/10/is-it-meaningful-to-talk-about-a-probabilityof-65-7-that-obama-will-win-the-election/ Gelman, A., and Lock, K. (2010). Bayesian combination of state polls and election forecasts. Political Analysis 18, 337-348. Gelman, A., Lee, D., and Ghitza, Y. (2010). Public opinion on health care reform. The Forum 8 (1), article 8. Shapiro, R. Y., and Arrow, S. A. (2009). Support for health care reform: Is public opinion more favorable for Obama than it was for Clinton in 1994? Lax, J., and Phillips, J. (2009). How should we estimate public opinion in the states? American Journal of Political Science 53, 107-121. Lax, J., and Phillips, J. (2009). Gay rights in the states: Public opinion and policy responsiveness. American Political Science Review 103, 367-386. Gelman, A., and Su, Y. S. (2011). Public opinion on school vouchers. Ghitza, Y., and Gelman, A. (2013). Deep interactions with MRP: Presidential turnout and voting patterns among small electoral subgroups. American Journal of Political Science. Homework due beginning of class 13a (problems 1 and 2) and class 14a (problems 3 and 4): 1. Bayesian inference. From a survey of 500 people, you estimate the proportion who support candidate A in the upcoming election to be 60%. From a forecast (not using this poll) you get a prediction that candidate A will win 51% of the vote. Let X be the standard error of this forecast. Further suppose that you estimate the nonsampling error of this poll to be equal to the sampling error. (a) Suppose that, given the above information, your Bayesian forecast is that A will receive 54% of the vote. What is X, and what is the standard error of your Bayesian forecast? (b) What is your Bayesian probability that candidate A will win the election? 2. Ideal-point modeling. You will create a measure of economic ideology using the following questions from the 2000 Annenberg survey: Are tax rates a problem (CBB01), Favor cutting taxes or strengthening social security (CBB05), Federal government should reduce the top tax rate (CBB10), Federal government should adopt flat tax (CBB13), Federal government should spend more on social security (CBC01), Favor investing social security in stock market (CBC05), Is poverty a problem (CBP01), Federal government should reduce income differences (CBP02), Federal government should spend more on aid to mothers with young children (CBP03), Federal government should expend effort to eliminate many business regulations (CBT01). Fit a hierarchical logistic regression to estimate ideal points for individuals and survey questions. (a) Display the estimated ideal points and standard errors of the survey questions (listing the questions in order of their estimated ideal points) (b) Display the distribution of estimated ideal points of the survey respondents. On this same graph, display the distributions for Democrats, independents, and Republicans. 3. Multilevel modeling. From the Pollster data, estimate a time series of support for Obama and Romney, adjusting for house effects and then smoothing the curve using some function such as lowess. Compare to the smoothed average of the unadjusted approval numbers from this series and comment on any differences. 4. Multilevel regression and poststratification. Download the cumulative National Election Study. (a) Fit a multilevel logistic regression estimating support for gun control given state, year, sex, and ethnicity (white/black/Hispanic/other). Use the display() function in R to display the fitted model. Explain the output in a brief paragraph. (b) Using your model, get estimates of the proportion of people who support gun control, for all 8 demographic groups in each state (excluding Alaska and Hawaii) for the year 2012. Using the 2010 census, poststratify to get an estimate for each state. (c) Make the following five graphs on a 3x2 grid: (i) a map of estimated gun control support by state in 2012; (ii) a plot of estimated gun control support vs. Obama vote share in 2012 (indicating each state by its two-letter abbreviation); (iii) a plot of estimated gun control support in 2012 vs, the raw proportion of respondents in the state from 2012 who supported gun control; (iv) a plot of estimated gun control support in 2012 vs, the raw proportion of respondents in the state from who supported gun control, pooling data from all years of NES; (v) a plot of estimated gun control support in 2012 vs. the state-level “random effects” from the fitted multilevel model. Weeks 13-14: Hard-to-reach populations Class 13a: Low response rates in U.S. surveys Class 13b: Surveys in less-developed countries Class 14a: Network sampling Class 14b: Review Topics: Callbacks Capture-recapture Respondent-driven sampling Stories: Friendsense How many X’s do you know? Polarization and perceived polarization The Iraq death surveys Census adjustment “Millionaires for McCain, Billionaires for Obama” Readings: U.S. Census Bureau (2012). Census Bureau releases estimates of undercount and overcount in the 2010 Census. http://www.census.gov/newsroom/releases/archives/2010_census/cb12-95.html Gelman, A. (2008). Political attitudes of the super-rich. Red State Blue State blog, 2 Nov. http://redbluerichpoor.com/blog/2008/11/political-attitudes-of-the-super-rich/ Page, B. I., Bartels, L. M., and Seawright, J. (2013). Democracy and the policy preferences of wealthy Americans. Perspectives on Politics 11, 51-73. Goel, S., Mason, W., and Watts, D. J. (2010). Real and perceived attitude agreement in social networks. Journal of Personality and Social Psychology. Hampton, K. N., Goulet, L. S., Rainie, L., and Purcell, K. (2011). Social networking sites and our lives. Pew Research Center report, 16 June. McCormick, T. H., Salganik, M. J., and Zheng, T. (2010). How many people do you know?: Efficiently estimating personal network size. Journal of the American Statistical Association 105, 59-70. Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems 44, 74-199. Goel, S., and Salganik, M. J. (2010). Assessing respondent-driven sampling. Proceedings of the National Academy of Sciences USA 107, 6743-6747. Spagat, M. (2009). The reliability of cluster surveys of conflict mortality: Violent deaths and non-violent deaths. Presentation given at the conference, International Conference on Recording and Estimation of Casualties, Carnegie Mellon University and University of Pittsburgh, 23-24 Oct. Gelman, A. (2010). Ethical and data-integrity problems in a study of mortality in Iraq. Statistical Modeling, Causal Inference, and Social Science blog, 27 Apr. http://andrewgelman.com/2010/04/ethical_and_dat_1/ Spagat, M. (2010). Ethical and data-integrity problems in the second Lancet survey of mortality in Iraq. Defence and Peace Economics 21, 1-41. Rothschild, D., and Wolfers, J. (2011). Forecasting elections: Voter intentions versus expectations. http://assets.wharton.upenn.edu/~rothscdm/RothschildExpectations.pdf