PS 170A: Introductory Statistics for Political Science and Public Policy
Transcription
PS 170A: Introductory Statistics for Political Science and Public Policy
PS 170A: Introductory Statistics for Political Science and Public Policy Lecture Notes (Part 6) Langche Zeng [email protected] 2 Simple Correlation and Regression Analysis 1. Introduction • Things exist in relationships. Nothing can be well understood in isolation. • Correlation and regression analysis are tools for studying relationships between two or more variables. • Correlation: is there a (linear) relationship? how strong? x and y symmetric • Regression: can x help explain y? what’s the relationship in mathe- 3 matical form? how to use that form to predict y? • examples? (lifeexp data being one: lexp as a function of safewater) • linear relationship: pp.257-259. interpretation of α and β • example of nonlinear relationship, p.260, fig.9.5 2. Simple Correlation Analysis • start with examining the scatter plot e.g. sysuse lifeexp. scatter lifeexp safewater scatter lifeexp gnppc 4 • the Pearson coefficient of correlation, r: numerical measure of direction of strength of the linear relationship between two quantitative variables. • formula: p.270 correlate x1 x2 • to test H0 : r = 0: r s n−2 1−r2 has a t distribution with n-2 degrees of freedom. stata: pairwise, with significance test of r = 0: pwcorr x1 x2 x3, sig try lifeexp data 5 • properties: (p.271) falls in [-1,1] meaning of sign and magnitude symmetric to x and y fig.9.11, p. 272: example of perfect non-linear relationship with linear correlation 0. • r2 is called the coefficient of determination. it measures the proportion of variations in y “explained” by x • formula and interpretation: p. 274, fig.9.13 r2 can be generalized to multiple regression case. software routinely 6 reports. 3. Simple Regression Analysis • what’s the form of the linear relationship? once we know the form, we can do a lot of useful things (interpretation, prediction). • how do we best estimate the relationship with a straight line? How do we fit a straight line to the scatter plot? according to what method/principle? • real data are rarely (if ever) represented exactly by a straight line. there are errors involved: yi = α + βxi + ²i 7 • in estimating a linear model, we try to find the “best” guess of α and β. • “Least Square” is one estimation principle that optimizes things in certain sense: it minimizes the sum of squared errors: min P ²2i draw figure. • optimization is a mathematical procedure. it results in “estimators” for the coefficients. An estimator is a formula that translates input data into an estimated value for a quantity of interest. ˆ p.261 • formula for α ˆ and β: for β, it’s the same as cov(x,y) V (x) 8 Messy looking, but easy to get from statistical packages. regress lexp safewater R: ?lm for details • goodness of fit: R2 verify that R2 is “Model SS/Total SS” in the stata output. R2 can fall between 0 and 1. MSS is TSS is P P ˆ − Y¯ )2 i (Y i i (Yi − Y¯ )2 ESS (or SSE) is P i (Yi − Yˆi)2 Show in figure: distance from a point Yi to the mean of Y can 9 be decomposed to two parts: the distance from Yi to the model prediction Yˆi (residual error), and that from the model prediction to Y¯ . • s SSE n−2 is the “Root MSE” in stata output. it is an estimate for the standard deviation of the residuals (which have mean 0). gives an idea of the average size of the residual errors. it provides an estimate for σ. the meaning of which is illustrated in fig. 9.8, p.267 for a relationship in the population (when the parameter values are known). ˆ is called the prediction equation. • yˆ = α ˆ + βx 10 e.g., lifeexp data also example 9.4, p.261. stata: predict ybar list ybar lexp safewater se of prediction: predict ybarse, stdp R: predict.lm(lm.out,se.fit = FALSE); or lm.out$fitted.values • effects of outliers fig.9.6, p.262 fig.9.17, p.285 11 • inferences for the coefficients: model assumptions p.276. assumption 1: the sample is representative of the population assumption 2: the model is correct assumption 3: homoscadesticity. fig.9.9, p.269 assumption 4: normality (uncritical for large N). •α ˆ and βˆ as random variables different sample data give different estimated coefficient values. we can think about how the estimated values are distributed over possible samples. 12 • and we can discuss properties of the estimators: unbiasedness, efficiency, etc. • can show that α ˆ and βˆ follow t distributions with n − 2 df. see stata output for t-values and p-values for testing null hypothesis. and see CIs constructed using the distributions. p.277: test of β = 0 • violations of assumptions reality check (later chapters) 13 Multivariate Relationships and Multiple Regression 1. Association and Causation • Real world relationships usually involve multiple variables. e.g. what are possible predictors of college GPA? voting behavior? • causal effect inference is central to social science research. • association does not equal causation. e.g. fig. 10.1: p.305. causal effect of height on math achievement? relationship disappears controlling for the grade level/age. 14 “spurious association” caused by some common cause. • other types of multivariate relationships: table 10.5, p.315 chain: father’s education → son’s education → son’s income interaction: effect of x1 on y depends on the value of x2. e.g., effect of education on income may depend on race or gender. direct and indirect effects: e.g., gender on party ID, both directly and through ideology. spurious non-association: suppose education → income for the same age group. suppose age is positively related to income but negatively with education. then without controlling for age the relationship 15 between education and income may not show up. • for a relationship to be considered causal, need to satisfy these necessary conditions: a) association b) appropriate time order (e.g., gender/race are causally prior to behavioral variables) c) elimination of alternative explanations • to achieve c), we frequently need to “control” the influence of other variables by holding their values constant. e.g. controlling for grade level in the height/math score example. 16 grade level is called a “control variable” in this example. holding its value constant is a case of “statistical control”. In regression analysis, control variables enter the model as independent variables, along with the key causal variable of interest. 17 2. Multiple regression analysis • For k > 1 explanatory variables. e.g. state level violent crime rate as a function of “poverty” (percentage of the state population living in poverty) and “metro” (percentage of the state population living in a metropolitan area, could be a common cause of crime rate and poverty). Model: yi = α + β1x1i + . . . + βk xki + ²i Or equivalently: E(Y ) = α + β1x1i + . . . + βk xki i = 1, 2, . . . n Where the ²i’s are assumed to be independent and distributed 18 N (0, σ), among other things. (Same as in simple linear model.) • Meaning of βk : A one unit increase in xk is associated with k units increase in E(y), holding all other x’s constant. If βk is 0 in the population, then there is no relationship between xk and y (for any i). k is called the “marginal effect” of xk on E(y). Meaning of α: E(y) when all x=0. • How to find the “best” α and β’s? According to the same OLS principle. Now fitting a “plane” in k+1 dimensional space, rather than a line in 2 dimensional space. (Imagine k=2.) We minimize the sum of squared errors from observed data points to the regression 19 plane. See fig. 11.1, p.322. • Property of OLS estimators (under the “classical linear model assumptions”): BLUE (best linear unbiased estimator). • Example: Violent Crime Rates = f(poverty, metro) The estimated regression model is E(Crime) = -495.87 + 33.16*Poverty + 9.57*Metro Mean(Poverty)=13.9%, mean(Metro)=66.0%. For all states with Poverty=13.9% and Metro= 66%, we predict the average violent crime rate to be -495.87 + 33.16*13.9 + 9.57*66 = 597 (per 100,000 population) 20 Prediction for an individual state is the same, but with higher uncertainty due to the random error term. The marginal effect of “poverty” on the average crime rate, holding “metro” constant (at any value), is 33.16— every 1% increase in ”poverty” corresponds to 33.16 more cases of violent crime per 100k population. • Standardized coefficients: To compare the relative effects of different independent variables, need to have standardized coefficients. The original coefficients depend on the units of measurement. 21 The standardized coefficient for xk , obtained by multiplying the estimated coefficient with Sxk Sy , measures the standard deviation change in y given a standard deviation change in xk . The standardized coefficients can also be obtained by using the zscores of the original variables in the regression model. Stata: regress y x1 x2 x3, beta • Goodness of Fit: the Coefficient of Determination (R 2) Measures how well the regression model fits the data R 2 measures how much variation in the values of the response variable (y) is explained by the regression model (i.e., by all the independent variables 22 collectively. The distance between an observed Y and the mean of Y in the data set can be decomposed into two parts: from Y to E(Y) given by the regression model, and from E(Y) to the mean of all Y. R 2 is defined as MSS/TSS, or 1-ESS/TSS (p.332). The higher the R 2, the better the fit. Adding more independent variables to the model never decreases R2—Stata reports the “adjusted R2” to account for model complexity. Ultimately, goodness of fit measures should not be used as the model 23 selection criterion, as a model could possibly over-fit the data. Compare out-of-sample prediction performance instead. • Checking functional form assumption: partial regression plot (also called added-variable plot) Plots the relationship between y and xk after removing the effects of the other predictors: residual from “reg y z” against residual from “reg x z”, where z denotes the set of all other independent variables. stata: avplots • Residual plots: residual against fitted values: “rvfplot”, see pattern of residuals, whether violation of assumptions. 24 Test of heteroscedasticity: estat hettest • Multicollinearity: when there is relatively strong correlation among some of the xk ’s, some of the individual variables may not add much predictive power. Correlation also makes interpretation of results difficult, since “holding all others constant” while moving one variable is unrealistic when the variables are strongly correlation. Try to use predictors with weak correlations if possible. • Hypothesis Testing: Is There a Relationship? The estimated k values are based on one particular sample set, and so are the sample intercept/slopes. What are the corresponding pop- 25 ulation parameters? a) Testing a population slope being 0 (intercept similar, but less interesting) being zero (i.e., testing the hypothesis that there is no relationship between some x and y): Recall logic of hypothesis test. Under the null hypothesis, the sampling distribution of the estimated βk /sd(βk ) is shown to follow the “Student-t” distribution (assuming unknown σ) with n-k-1 degrees of freedom. Software routinely reports the p-values from the test. (See Stata output) b) We can also test the “global” hypothesis that all β k ’s are si- 26 multaneously 0, i.e., our independent variables as a group have no significant effect on our dependent variable. This is done using the so-called “F-test”, p-values for F(1,n-k-1) routinely reported by software. Rejecting null means: at least one x “matters”. Formula for F: p.336. figure 11.9. Stata output F statistic. c) testing a subgroup of parameters being 0: The global F-test being a special case. formula for F: p.345 bottom. Stata: test P=M=0 (for example) 27 • Beware... OLS not robust to outliers (regardless of the number of x variables) Extrapolation beyond observed data region dangerous Correlation does not imply causation. Properties of OLS estimators hold only if the model assumptions are satisfied • Modeling Interaction Effects: Special Case of Non-linearity In the linear additive model, the marginal effect of some x on E(y) is constant, independent of the values of the other x’s in the model. This is generally not true in a non-linear model. 28 Interaction effect model is a special case of a non-linear model. Simple example: E(Y ) = α + β1x1 + β2x2 + β3x1x2 In this model, the marginal effect of x1 depends on the value of x2. e.g. x1 = Gender (female=1), x2 = Education (high=1), Y=Prochoice abortion opinion (higher score → stronger pro-choice views). Estimated model: (showing reversed gender gap) E(Y ) = 4.04 − .55x1 + 1.09x2 + 1.16x1x2 male/low educ: 4.04; female/low educ: 4.04-.55; male/high educ: 4.04+1.09; female/high educ: 4.04-.55+1.09+1.16 29 Can write out the prediction model for separate groups. The slope of x1, for example, (as well as the intercept of the model), differs when x2 takes different values. Another example: p.342. Fig. 11.10. Try “reg VR P M PM” (after “use http://dss.ucsd.edu/˜lazeng/ps170/table9.1.dta gen PM=P*M”. ) What is the marginal effect of P (M) when M (P) is at the mean? • Dummy X Variables Sometimes one or more of our independent variables may be categorical variables, such as gender or race. Multiple valued categorical 30 variables can be recoded into a set of binary “dummy” variables taking values 0/1. e.g. White/Black/Hispanic/Asian (Why we don’t want to use the multiple valued variable “race” in the regression model, if it’s coded say 1,2,3,4?) If there are m categories, we use m-1 dummies in the model, since the last one does not add any information: knowing the value of “White”, “Black”, and “Hispanic” we can infer the value of “Asian” (assuming these exhaust the racial categories in the data). Similarly, for “gender” we only need one variable, not two. Dummy variables change the intercept and/or the slope if the relationships for different groups represented by the dummy. 31 The most natural way of interpreting the effect of a dummy variable is to see its effect on Y as it goes from 0 to 1. If Y is a dummy variable, standard linear regression model doesn’t apply. We’ll need to use models for binary distributions, such as logit or probit, to which we turn next. 32 Basics of Logit/Probit Models • Binary dv, model: Y ∗ = xβ + ² Y = 1 if Y ∗ > 0 Assuming probability distribution for ², P (Y = 1|X) = P (Y ∗ > 0|X) = P (² > −Xβ) = P (² ≤ Xβ) = Φ(Xβ) (Probit) or 1 1+e−Xβ (logit) Graph: fig.15.1, p.484 33 • what can be learned from Stata output, what cann’t use http://dss.ucsd.edu/˜lazeng/ps170/class.dta gen abortion=(ab==”y”) gen lifeafter=(ld==”y”) gen gender=(ge==”f”) logit abortion gender lifeafter pi • Predicted probabilities, first differences, etc., with uncertainty/CI: “findit clarify” to find and install clarify estsimp logit abortion gender lifeafter pi setx gender 1 lifeafter 1 34 simqi, fd(pr) changex(pi 1 7) mean diff in pr=-.6751331, sd=.2051763, 95% CI=(-.939707, -.155823) (“help estsimp” and so on) R: glm, Zelig 35 Relationships between categorical variables • When both dependent and indepedent variables are categorical, data can be presented in a contingency table. e.g. table 8.1, p.222. Party ID and gender. • Non-parametric analysis of the relationship: is there an association? What’s the (cell) pattern of the association, and what’s the strength of the association? No distributional assumptions on “error” terms. Just working with the actual bserved raw data • Independence: the distribution of Y is independent of X: 36 P (Y |X = x1) = P (Y |X = x2) = . . . = P (Y |X = xr ) = P (Y ) e.g., table 8.3, p.224. Party ID distribution independent of race. P(Y=D)=.44 whatever the race value is. Similar for other values of Y. In contrast, table 8.2, p.222 might be evidence for dependence/association. “Might be” because this is sample data, thus uncertainty about the population relationship. • Chi-Sqaure test of independence: (for nominal data) Under the null hypothesis that there is no association, the distribution of Y should be independent of the values of X. So knowing the total 37 number of cases, and the total number for each value of Y , we can write down the expected frequency of observed Y values. table 8.4, p.225: under H0, P (Y = D) = ND /N = 959/2771, P (Y = R) = NR/N = 821/2771, P (Y = I) = NI /N = 991/2771, These probabilities should stay the same regardless of whether “Gender” value we are looking at. So for example the expected frequency for the “Female” and “Democrat” cell should be x such that x/1511 = 959/2771, → x = (959/2771) ∗ 1511 Expected frequencies for Other cells are similarly filled. 38 Now we can compared the observed frequencies with the expected. If H0 is true, we expect that they don’t differ much. The sum of normalized squared differences supplies the χ2 test statistic on p.225, with degrees of freedom (r − 1) ∗ (c − 1) (r: no. of rows; c: no. of cols), assuming N reasonablly large: expected frequency in each cell> 5 (if not satisfied, use Fisher’s exact test.) Density of χ2 distributions: fig.8.2, p.226. Summary of the Chi-sqaure test: table 8.5, p.228. • Cell residual pattern: how do the data differ from the expected pattern? 39 Standardized residuals (box, p.230). example data: table 8.8, p.231. • Strength of association Chi-square test does not measure the strength, but test the existence, of an association. For nominal data, hard to summarize the strength of association with a single number for larger than 2x2 tables, too many possible association patterns. 2x2 tables: can look at the difference in proportions, magnitude in [0,1]. e.g., table 8.11, p.234. 40 General idea is to look at how different P (Y ) can be for different X values. So distance measures could apply.