Writing assignment 3 by Christina Tran - Full
Transcription
Writing assignment 3 by Christina Tran - Full
Christina Tran California State University, Fullerton Spring 2015 Math 437 3.3-3.5 Summary 3.3.1 Qualitative Predictors Not all data is quantitative; data can be qualitative such as gender. Predictors with Only Two Levels Qualitative predictors are also known as factors. If a factor has two levels (possible values), then incorporating it into a regression is simple. We create a dummy variable that takes on two possible numerical values, namely 0 and 1. For example, we can write: ( 1 if ith person is female xi = 0 if ith person is male We can use this variable as a predictor in the regression model: ( β0 + β1 xi + i if ith person is female yi = β0 + i if ith person is male So, β0 can be interpreted as the average credit card balance among males and β0 + β1 as the average credit card balance among females where β1 is the average difference in credit card balance between females and males. Or, we can create another dummy variable where xi = 1 if ith person is female and -1 if ith person is male. Then, ( β0 + β1 + i if ith person is female yi = β0 + β1 xi + i = β0 − β1 + i if ith person is male Here, β0 would be the average of the total debt (halfway between the intercept+β1 divided by two) and β1 would be the average of the difference in debt between males and females (β1 divided by two). Both methods will give the same result. The only difference is in the interpretation of the coefficients. Qualitative Predictors with More than Two Levels We create additional variables for qualitative predictors with more than two levels. For example, take ethnicity. We use the above method but for Asian, for example, Caucasian, African American, etc. Then, the model would look something like: β0 + β1 + 1 if ith person is Asian yi = β0 + β1 x1 + β2 x2 + i = β0 + β2 + i if ith person is Caucasian β0 + i if ith person is African American 1 Here, β0 can be the average credit card balance for African Americans, β1 as the difference between Asian and African American, and β2 the difference between Caucasian and African Americans. There will always be one fewer dummy variable than the number of levels. The level with no dummy variable - African American in this example - is called the baseline. If the table shows high p-values associated with the ethnicity, this implies there is no statistical evidence of a real difference in credit card balance between the ethnicities. Extensions of the Linear Model The models mentioned before suggest an additive and linear relationship between the predictors and responses; a huge assumption we wish not to make! This means that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors. Removing the Additive Assumption A larger increase/decrease in the response from allocating different amounts to each predictor rather than allocating all amounts to a single predictor is called the synergy effect and in statistics is called the interaction effect. Consider Y = β0 +β1 X1 +β2 X2 +. According to this model, if we increase X1 by one unit, then Y will increase by an average of β1 units. So, the presence of X2 doesn’t alter the statement; that is, regardless of X2 , a one-unit increase in X1 will lead to a β1 -increase in Y . One way of extending this model to allow for interaction effects is to include a third predictor, called an interaction term, which is constructed by computing the product of X1 and X2 . Then we have: Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + = β0 + (β1 + β3 X2 )X1 + β2 X2 + Since β1 + β3 X2 changes with X2 , the effect of X1 on Y is no longer constant: adjusting X2 will change the impact of X1 on Y . The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant. In other words, if the interaction between X1 and X2 seems important, then we should include them both in the model even if their coefficient estimates have large p-values. Non-linear Relationships Polynomial regression accommodates for non-linear relationships. A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors in the model. For example, to induce a quadratic shape, we could make a model Y = β0 + β1 X1 + β2 X12 + . 3.3.3 Potential Problems Problems include: 1. Non-linearity of the response-predictor relationships 2 2. Correlation of error terms 3. Non-constant variance of error terms 4. Outliers 5. High-leverage points 6. Collinearity 1. Non-linearity of the Data Residual plots are useful for identifying non-linearity. We can plot the residuals, ei = yi − yˆi versus the predictor xi . In multiple regression, we instead plot the residuals versus the predicted (or fitted ) values yˆi . 2. Correlation of Error Terms In linear regression, we assume that the error terms are uncorrelated, which means that the fact that ei is positive provides little or no information about the sign of ei+1 . The standard errors computed arebased on the assumption of uncorrelated error terms. If there is a correlation, then the estimated standard errors will underestimate the true standard errors. This results in narrower confidence and prediction intervals. Also, p-values associated with the model will be lower than they should be. Correlations in the error terms may occur in the context of time series data, which consists of observations for which measurements are obtained at discrete points in time. So, observations that are obtained at adjacent time points will have positively correlated errors. To determine if this is the case, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no discernible pattern If they are positively correlated, then they may see tracking in the residuals–that is, adjacent residuals may have similar values. 3. Non-constant Variance of Error Terms The variances of the error terms may be non-constant. For instance, the variances of the error terms may increase with the value of the response. One can identify non-constant variances in the errors, or heteroscedasticity, from the presence of a funnel shape in the residual plot. A √ possible solution is to transform the response Y using a concave function such as log Y or Y . 4. Outliers An outlier is a point for which yi is far from the value predicted by the model. Outliers can screw up the residual standard error and/or the R2 statistic. Residual plots can be used to identify outliers. It can be difficult to to decide how large a residual needs to be before we consider the point to be an outlier. To address this, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual ei by its estimated 3 standard error. Observations whose studentized residuals greater than 3 in absolute value are possible outliers. 5. High Leverage Points Observations with high leverage have an unusual value of xi . For example, an observation could have high leverage because the predictor value for the observation is large relative to the other observations. High-leverage points have high impact on the estimated regression line. Thus, it’s important to find these points! In simple linear regression, we can simply look for observations for which the predictor value is outside the normal range of observations. In multiple regression, it can be more difficult. Consider an example with two predictors, X1 , X2 . Plotting X1 versus X2 , suppose we see a predictor that falls within the range of X1 and X2 . Then, it’s hard to identify it as a high-leverage point. So, we compute the high-leverage point (for simple linear regression) as (xi − x)2 1 . So, hi increases with the distance of xi from x. The leverage statistic hi = + P n n (xi0 − x)2 i0 =1 1 and 1 and the average leverage statistic for all the observations is n p+1 (p + 1) always equal to . So, if a leverage statistic for an observation greatly exceeds , n n then we may suspect that the point has high leverage. hi is always between 6. Collinearity Collinearity refers to the situation in which two or more predictor variables are closely related to one another. The presence of collinearity can pose problems in the regression context since it can be difficult to separate out the individual effects of collinear variables on the response. In other words, since two predictors tend to increase or decrease together, it can be difficult to determine how one separately is associated wit the response. Contour plots can help show some of the difficulties that result from collinearity. Each ellipse represents a set of coefficients that correspond to the same RSS, with ellipses nearest to the center taking on the lowest values of RSS. The black dots are the least squares estimates. For a data set that is not collinear, the rings are circles and are evenly spaced by a factor of .25. Each ring shows a .25 increase in standard error. Three rings means four increases in standard error. On the other hand, predictors that are collinear give a narrow ellipse and therefore gives a broad range of values for the coefficient estimates that result in equal values for RSS. So, a small change in the data could cause the pair of coefficient values that yield the smallest RSS to move anywhere along the ellipse; resulted is a great deal of uncertainty in the coefficient estimates. Collinearity reduces the accuracy of the estimates of the regression coefficients. Thus, it causes the standard error for βˆj to grow. Recall that the t-statstic for each predictor is calculated by dividing βˆj by its standard error. Consequently, collinearty results in a decline in the t-statistic and so in the presence of collinearity, we may fail to reject the null hypothesis and so the power of the hypothesis test is reduced! When this happens, be sure to note collinearity issues in the model. 4 A simple way to detect collinearity is to look at the correlation matrix of the predictors. The larger the elements in absolute value, the more highly correlated the variables are. Unfortunately, not all collinearity problems can be detected using the matrix since it’s possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this multicollinearity. Instead of looking at the correlation matrix, we compute the variance inflation factor (VIF) which is the ratio of the variance of βˆj when fitting the full model divided by the variance of βˆj if fit on its own. The smallest value for VIF is 1, which indicates the complete absence of collinearity. A VIF over 5 or 10 indicates a problematic amount of collinearity. We compute the VIF 2 1 2 ˆ , where RX is the R2 from regression of Xj onto all other predictors. (βj ) = j |X−j 1 − RXj |X−j 2 If RX is close to one, then collinearity is present and so VIF will be large. Thus, when j |X−j faced with collinearity, we can drop one of the problematic variables from the regression fit. This can be done without compromising too much of the regression fit since collinearity implies that information that this variable provides about the response is redundant in the presence of other variables. Another solution is to combine the colinear variables together into a single predictor. The Marketing Plan 1. Is there a relationship between advertising sales and budget? Fit a multiple regression model of sales onto TV, radio, and newspaper, and testing H0 . The F-statistic can be used to determine whether or not to reject the null hypothesis 2. How strong is the relationship? The RSE estimates the standard deviation of the response from the population regression line. For advertising data, the RSE is 1681 units while the mean value for the response is 14,022, indicating a percentage error of roughly 12%. The R2 statistic records the percentage of variability in te response that is explained by the predictors. The predictors explain almost 90% of the variance in sales. 3. Which media contribute to sales? We look at the p-values associated with each predictor’s t-statistic. 4. How large is the effect of each medium on sales? The standard error of βˆj can be used to construct confidence intervals for βj . If the confidence intervals are narrow and far from zero which provides evidence that those media are related to the response. If an interval includes zero, then the variable is not statistically significant. 5. How accurately can we predict future sales? The accuracy associated with this estimate depends on whether we wish to predict an individual response, Y = f (x) + , or average response f (X). If the former, we use a prediction interval. If the latter, we use confidence interval. Prediction intervals will always be wider than a confidence interval because they account for the uncertainty associated with , the irreducible error. 6. Is the relationship linear? Residual plots can be used to identify non-linearity. If the relationships are linear, then the residual plots should display no pattern. Remember 5 that the inclusion of transformations of the predictors in the linear regression model can be done to accommodate non-linear relationships. 7. Is there synergy among the advertising media? The standard linear regression model assumes an additive, linear relationship between the predictors and the response. An additive model is easy to interpret because the effect of each predictor on the response is unrelated to the values of the other predictors. However, the additive assumption may be unrealistic. Including an interaction in the regression model in order to accommodate non-additive relationships will help. Comparison of Linear Regression with K-Nearest Neighbors The KNN regression method is closely related to the KNN classifier. Given a value for K and a prediction point x0 , KNN regression first identifies the K training observations 1 P . The optimal value of K that are closest to x0 , represented by N0 . So, fˆ(x0 ) = K xi ∈N0 yi will depend on the bias-variance tradeoff. A small K gives a flexible, slutty fit which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent on just one observation. Larger values of K provide a smoother and less variable fit. The prediction in a region is an average of several points and so changing one observation has a smaller effect. Though, the smoothing may cause bias by masking some of the structure in f (X). Note: the parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of f. 6