HOW TO RUN LOGISTIC REGRESSION WITH IBM SPSS-20 VERSION
Transcription
HOW TO RUN LOGISTIC REGRESSION WITH IBM SPSS-20 VERSION
HOW TO RUN LOGISTIC REGRESSION WITH IBM SPSS-20 VERSION http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.s tatistics.help%2Fidh_lreg_val.htm Logistic regression is useful for situations in which you want to be able to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model. Logistic regression is applicable to a broader range of research situations than discriminant analysis. Statistics. For each analysis: total cases, selected cases, valid cases. For each categorical variable: parameter coding. For each step: variable(s) entered or removed, iteration history, –2 log-likelihood, goodness of fit, Hosmer-Lemeshow goodness-of-fit statistic, model chi-square, improvement chi-square, classification table, correlations between variables, observed groups and predicted probabilities chart, residual chi-square. For each variable in the equation: coefficient (B), standard error of B, Wald statistic, estimated odds ratio (exp(B)), confidence interval for exp(B), log-likelihood if term removed from model. For each variable not in the equation: score statistic. For each case: observed group, predicted probability, predicted group, residual, standardized residual. Methods. You can estimate models using block entry of variables or any of the following stepwise methods: forward conditional, forward LR, forward Wald, backward conditional, backward LR, or backward Wald. Binary logistic regression is most useful when you want to model the event probability for a categorical response variable with two outcomes. For example: • A catalog company wants to increase the proportion of mailings that result in sales. 1 • A doctor wants to accurately diagnose a possibly cancerous tumor. • A loan officer wants to know whether the next customer is likely to default. Other example. What lifestyle characteristics are risk factors for coronary heart disease (CHD)? Given a sample of patients measured on smoking status, diet, exercise, alcohol use, and CHD status, you could build a model using the four lifestyle variables to predict the presence or absence of CHD in a sample of patients. The model can then be used to derive estimates of the odds ratios for each factor to tell you, for example, how much more likely smokers are to develop CHD than nonsmokers. Using the Binary Logistic Regression procedure, the catalog company can send mailings to the people who are most likely to respond, the doctor can determine whether the tumor is more likely to be benign or malignant, and the loan officer can assess the risk of extending credit to a particular customer. The Logistic Regression Model In the logistic regression model, the relationship between Z and the probability of the event of interest is described by this link function. e zi 1 i 1 e zi 1 e zi , or zi log i 1 i where pi zi is the probability the ith case experiences the event of interest is the value of the unobserved continuous variable for the ith case The model also assumes that Z is linearly related to the predictors. zi b0 b1 xi1 b2 xi 2 ... bp xip where xij bj p is the jth predictor for the ith case is the jth coefficient is the number of predictors 2 If Z were observable, you would simply fit a linear regression to Z and be done. However, since Z is unobserved, you must relate the predictors to the probability of interest by substituting for Z. i 1 e 1 b0 b1 xi1 ...bp xip The regression coefficients are estimated through an iterative maximum likelihood method. Set Rule Cases defined by the selection rule are included in model estimation. For example, if you selected a variable and equals and specified a value of 5, then only the cases for which the selected variable has a value equal to 5 are included in estimating the model. Statistics and classification results are generated for both selected and unselected cases. This provides a mechanism for classifying new cases based on previously existing data, or for partitioning your data into training and testing subsets, to perform validation on the model generated. This feature requires the Regression option. From the menus choose: Analyze > Regression > Binary Logistic… In the Logistic Regression dialog box, click Select. Choose a selection variable and then click Rule. Define a selection rule for selecting a subset of cases for analysis. The name of the selected variable appears on the left. Select a relation. The available relations are equals, not equal to, less than, less than or equal to, greater than, and greater than or equal to. Enter a value. Variable Selection Methods Method selection allows you to specify how independent variables are entered into the analysis. Using different methods, you can construct a variety of regression models from the same set of variables. • Enter. A procedure for variable selection in which all variables in a block are entered in a single step. • Forward Selection (Conditional). Stepwise selection method with entry testing based on the significance of the score statistic, and removal testing based on the 3 probability of a likelihood-ratio statistic based on conditional parameter estimates. • Forward Selection (Likelihood Ratio). Stepwise selection method with entry testing based on the significance of the score statistic, and removal testing based on the probability of a likelihood-ratio statistic based on the maximum partial likelihood estimates. • Forward Selection (Wald). Stepwise selection method with entry testing based on the significance of the score statistic, and removal testing based on the probability of the Wald statistic. • Backward Elimination (Conditional). Backward stepwise selection. Removal testing is based on the probability of the likelihood-ratio statistic based on conditional parameter estimates. • Backward Elimination (Likelihood Ratio). Backward stepwise selection. Removal testing is based on the probability of the likelihood-ratio statistic based on the maximum partial likelihood estimates. • Backward Elimination (Wald). Backward stepwise selection. Removal testing is based on the probability of the Wald statistic. The significance values in your output are based on fitting a single model. Therefore, the significance values are generally invalid when a stepwise method is used. All independent variables selected are added to a single regression model. However, you can specify different entry methods for different subsets of variables. For example, you can enter one block of variables into the regression model using stepwise selection and a second block using forward selection. To add a second block of variables to the regression model, click Next. Define Categorical Variables You can specify details of how the Logistic Regression procedure will handle categorical variables: Covariates. Contains a list of all of the covariates specified in the main dialog box, either by themselves or as part of an interaction, in any layer. If some of these are string variables or are categorical, you can use them only as categorical covariates. Categorical Covariates. Lists variables identified as categorical. Each variable includes a notation in parentheses indicating the contrast coding to be used. String variables (denoted by the symbol < following their names) are already present in 4 the Categorical Covariates list. Select any other categorical covariates from the Covariates list and move them into the Categorical Covariates list. Change Contrast. Allows you to change the contrast method. Available contrast methods are: • Indicator. Contrasts indicate the presence or absence of category membership. The reference category is represented in the contrast matrix as a row of zeros. • Simple. Each category of the predictor variable (except the reference category) is compared to the reference category. • Difference. Each category of the predictor variable except the first category is compared to the average effect of previous categories. Also known as reverse Helmert contrasts. • Helmert. Each category of the predictor variable except the last category is compared to the average effect of subsequent categories. • Repeated. Each category of the predictor variable except the first category is compared to the category that precedes it. • Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced. Polynomial contrasts are available for numeric variables only. • Deviation. Each category of the predictor variable except the reference category is compared to the overall effect. If you select Deviation, Simple, or Indicator, select either First or Last as the reference category. Note that the method is not actually changed until you click Change. String covariates must be categorical covariates. To remove a string variable from the Categorical Covariates list, you must remove all terms containing the variable from the Covariates list in the main dialog box. This feature requires the Regression option. From the menus choose: Analyze > Regression > Binary Logistic… In the Logistic Regression dialog box, select at least one variable in the Covariates list and then click Categorical. In the Categorical Covariates list, select the covariate(s) whose contrast method you wish to change. You can change multiple covariates simultaneously. Select a contrast method from the Method drop-down list. Click Change. 5 Save New Variables You can save results of the logistic regression as new variables in the active dataset: Predicted Values. Saves values predicted by the model. Available options are Probabilities and Group membership. • Probabilities. For each case, saves the predicted probability of occurrence of the event. A table in the output displays name and contents of any new variables. The "event" is the category of the dependent variable with the higher value; for example, if the dependent variable takes values 0 and 1, the predicted probability of category 1 is saved. • Predicted Group Membership. The group with the largest posterior probability, based on discriminant scores. The group the model predicts the case belongs to. Influence. Saves values from statistics that measure the influence of cases on predicted values. Available options are Cook's, Leverage values, and DfBeta(s). • Cook's. The logistic regression analog of Cook's influence statistic. A measure of how much the residuals of all cases would change if a particular case were excluded from the calculation of the regression coefficients. • Leverage Value. The relative influence of each observation on the model's fit. • Df Beta(s). The difference in beta value is the change in the regression coefficient that results from the exclusion of a particular case. A value is computed for each term in the model, including the constant. Residuals. Saves residuals. Available options are Unstandardized, Logit, Studentized, Standardized, and Deviance. • Unstandardized Residuals. The difference between an observed value and the value predicted by the model. • Logit Residual. The residual for the case if it is predicted in the logit scale. The logit residual is the residual divided by the predicted probability times 1 minus the predicted probability. • Studentized Residual. The change in the model deviance if a case is excluded. 6 • Standardized Residuals. The residual divided by an estimate of its standard deviation. Standardized residuals, which are also known as Pearson residuals, have a mean of 0 and a standard deviation of 1. • Deviance. Residuals based on the model deviance. Export model information to XML file. Parameter estimates and (optionally) their covariances are exported to the specified file in XML (PMML) format. You can use this model file to apply the model information to other data files for scoring purposes. See the topic Scoring Wizard for more information. This feature requires the Regression option. From the menus choose: Analyze > Regression > Binary Logistic… In the Logistic Regression dialog box, click Save. Options You can specify options for your logistic regression analysis: Statistics and Plots. Allows you to request statistics and plots. Available options are Classification plots, Hosmer-Lemeshow goodness-of-fit, Casewise listing of residuals, Correlations of estimates, Iteration history, and CI for exp(B). Select one of the alternatives in the Display group to display statistics and plots either At each step or, only for the final model, At last step. • Hosmer-Lemeshow goodness-of-fit statistic. This goodness-of-fit statistic is more robust than the traditional goodness-of-fit statistic used in logistic regression, particularly for models with continuous covariates and studies with small sample sizes. It is based on grouping cases into deciles of risk and comparing the observed probability with the expected probability within each decile. Probability for Stepwise. Allows you to control the criteria by which variables are entered into and removed from the equation. You can specify criteria for Entry or Removal of variables. • Probability for Stepwise. A variable is entered into the model if the probability of its score statistic is less than the Entry value and is removed if the probability is greater than the Removal value. To override the default settings, enter positive values for Entry and Removal. Entry must be less than Removal. Classification cutoff. Allows you to determine the cut point for classifying cases. Cases with predicted values that exceed the classification cutoff are classified as positive, while those with predicted values smaller than the cutoff are classified as negative. To change the default, enter a value between 0.01 and 0.99. 7 Maximum Iterations. Allows you to change the maximum number of times that the model iterates before terminating. Include constant in model. Allows you to indicate whether the model should include a constant term. If disabled, the constant term will equal 0. This feature requires the Regression option. From the menus choose: Analyze > Regression > Binary Logistic… In the Logistic Regression dialog box, click Options. Additional Features The command syntax language also allows you to: • Identify casewise output by the values or variable labels of a variable. • Control the spacing of iteration reports. Rather than printing parameter estimates after every iteration, you can request parameter estimates after every nth iteration. • Change the criteria for terminating iteration and checking for redundancy. • Specify a variable list for casewise listings. • Conserve memory by holding the data for each split file group in an external scratch file during processing. Example: Using Binary Logistic Regression to Assess Credit Risk If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit risks. Suppose information on 850 past and prospective customers is contained in bankloan.sav. in the SPSSv20 sample folder: C:\Program Files\IBM\SPSS\Statistics\20\Samples\ English. The first 700 cases are customers who were previously given loans. Use a random sample of these 700 customers to create a logistic regression model, setting the remaining customers aside to validate the analysis. Then use the model to classify the 150 prospective customers as good or bad credit risks. Setting the random seed allows you to replicate the random selection of cases in this analysis. ► To set the random seed, from the menus choose: Transform > Random Number Generators... 8 Preparing the Data for Analysis ► Select Set Starting Point. ► Select Fixed Value and type 9191972 as the value ► Click OK. ► To create the selection variable for validation, from the menus choose: Transform > Compute Variable... ► Type validate in the Target Variable text box. ► Type rv.bernoulli(0.7) in the Numeric Expression text box. This sets the values of validate to be randomly generated Bernoulli variates with probability parameter 0.7. 9 You only intend to use validate with cases that could be used to create the model; that is, previous customers. However, there are 150 cases corresponding to potential customers in the data file. ► To perform the computation only for previous customers, click If. ► Select Include if case satisfies condition. ► Type MISSING(default) = 0 as the conditional expression. This ensures that validate is only computed for cases with non-missing values for default; that is, for customers who previously received loans. ► Click Continue. ► Click OK in the Compute Variable dialog box. Approximately 70 percent of the customers previously given loans will have a validate value of 1. These customers will be used to create the model. The remaining customers who were previously given loans will be used to validate the model results. Running the Analysis ► To create the logistic regression model, from the menus choose: Analyze > Regression > Binary Logistic... 10 ► Select Previously defaulted as the dependent variable. ► Select Age in years through Other debt in thousands as covariates. ► Select Forward: LR as the method. ► Select validate as the selection variable and click Rule. 11 ► Type 1 as the value for selection variable. ► Click Continue. ► Click Categorical in the Logistic Regression dialog box. ► Select Level of education as a categorical covariate. ► Click Continue. ► Click Save in the Logistic Regression dialog box. ► Select Probabilities in the Predicted Values group, Cook's in the Influence group, and Studentized in the Residuals group. ► Click Continue. ► Click Options in the Logistic Regression dialog box. 12 ► Select Classification plots and Hosmer-Lemeshow goodness-of-fit. ► Click Continue. ► Click OK in the Logistic Regression dialog box. These selections generate the following command syntax: LOGISTIC REGRESSION VAR=default /SELECT validate EQ 1 /METHOD=FSTEP(LR) age ed employ address income debtinc creddebt othdebt /CONTRAST (ed)=Indicator /SAVE PRED COOK SRESID 13 /CLASSPLOT /PRINT=GOODFIT /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) . • The procedure fits a logistic regression model for default. • The SELECT subcommand specifies that only cases with a value of 1 on the variable validate will be used to build the model. Other cases will be scored and can be used to validate the final model. • The variables specified on the METHOD subcommand will be entered into the model using forward stepwise regression based on the significance of the likelihood ratio statistic. • The CONTRAST subcommand specifies that ed is a categorical variable and should be processed using indicator coding. • The SAVE subcommand requests that the casewise predicted values, Cook's influence statistics, and studentized residuals be saved as new variables to the dataset. • The CLASSPLOT subcommand requests a classification plot of the observed and predicted values. • The PRINT subcommand requests the Hosmer-Lemeshow goodness-of-fit statistic. • All other options are set to their default values. Model Diagnostics After building a model, you need to determine whether it reasonably approximates the behavior of your data. Tests of Model Fit. The Binary Logistic Regression procedure reports the HosmerLemeshow goodness-of-fit statstic. Residual Plots. Using variables specified in the Save dialog box, you can construct various diagnostic plots. Two helpful plots are the change in deviance versus predicted probabilities and Cook's distances versus predicted probabilities. Tests of Model Fit Goodness-of-fit statistics help you to determine whether the model adequately describes the data. The Hosmer-Lemeshow statistic indicates a poor fit if the significance value is less than 0.05. Here, the model adequately fits the data. 14 This statistic is the most reliable test of model fit for IBM® SPSS® Statistics binary logistic regression, because it aggregates the observations into groups of "similar" cases. The statistic is then computed based upon these groups. Change in Deviance Versus Predicted Probabilities The change in deviance is not an option in the Save dialog box, but it can be estimated by squaring the studentized residuals. ► To create the change in deviance, from the menus choose: Transform > Compute Variable... ► Type chgdev as the Target Variable. ► Type sre_1**2 as the Numeric Expression. ► Click OK. The squared studentized residuals have been saved to chgdev. 15 ► To produce the residual plot, from the menus choose: Graphs > Chart Builder... ► Select the Scatter/Dot gallery and choose Simple Scatter. ► Select chgdev as the variable to plot on the Y axis. ► Select Predicted probability as the variable to plot on the X axis. ► Click OK. 16 The change in deviance plot helps you to identify cases that are poorly fit by the model. Larger changes in deviance indicate poorer fits. There are two distinct patterns in the plot: a curve that extends from the lower left to the upper right, and a curve that extends from the upper left to the lower right. • The curve that extends from the lower left to the upper right corresponds to cases in which the dependent variable has a value of 0. Thus, non-defaulters who have large model-predicted probabilities of default are poorly fit by the model. • The curve that extends from the upper left to the lower right corresponds to cases in which the dependent variable has a value of 1. Thus, defaulters who have small model-predicted probabilities of default are poorly fit by the model. By identifying the cases that are poorly fit by the model, you can focus on how those customers are different, and hopefully discover another predictor that will improve the model. 17 ► Recall the Chart Builder. ► Select Analog of Cook's influence statistics as the variable to plot on the Y axis. ► Select Predicted probability as the variable to plot on the X axis. ► Click OK. 18 The shape of the Cook's distances plot generally follows that of the previous figure, with some minor exceptions. These exceptions are high-leverage points, and can be influential to the analysis. You should note which cases correspond to these points for further investigation. 19 Choosing the Right Model There are usually several models that pass the diagnostic checks, so you need tools to choose between them. • Automated Variable Selection. When constructing a model, you generally want to only include predictors that contribute significantly to the model. The Binary Logistic Regression procedure offers several methods for stepwise selection of the "best" predictors to include in the model. • Pseudo R-Squared Statistics. The r-squared statistic, which measures the variability in the dependent variable that is explained by a linear regression model, cannot be computed for logistic regression models. The pseudo r-squared statistics are designed to have similar properties to the true r-squared statistic. • Classification and Validation. Crosstabulating observed response categories with predicted categories helps you to determine how well the model identifies defaulters. 20 Variable Selection Forward stepwise methods start with a model that doesn't include any of the predictors. At each step, the predictor with the largest score statistic whose significance value is less than a specified value (by default 0.05) is added to the model. The variables left out of the analysis at the last step all have significance values larger than 0.05, so no more are added. The variables chosen by the forward stepwise method should all have significant changes in -2 log-likelihood. The change in -2 log-likelihood is generally more reliable than the Wald statistic. If the two disagree as to whether a predictor is useful to the model, trust the change in -2 loglikelihood. See the regression coefficients table for the Wald statistic. 21 As a further check, you can build a model using backward stepwise methods. Backward methods start with a model that includes all of the predictors. At each step, the predictor that contributes the least is removed from the model, until all of the predictors in the model are significant. If the two methods choose the same variables, you can be fairly confident that it's a good model. R-Squared Statistics In the linear regression model, the coefficient of determination, R2, summarizes the proportion of variance in the dependent variable associated with the predictor (independent) variables, with larger R2 values indicating that more of the variation is explained by the model, to a maximum of 1. For regression models with a categorical dependent variable, it is not possible to compute a single R2 statistic that has all of the characteristics of R2 in the linear regression model, so these approximations are computed instead. The following methods are used to estimate the coefficient of determination. • Cox and Snell's R2 (Cox and Snell, 1989) is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model. • Nagelkerke's R2 (Nagelkerke, 1991) is an adjusted version of the Cox & Snell Rsquare that adjusts the scale of the statistic to cover the full range from 0 to 1. • McFadden's R2 (McFadden, 1974) is another version, based on the log-likelihood kernels for the intercept-only model and the full estimated model. What constitutes a “good” R2 value varies between different areas of application. While these statistics can be suggestive on their own, they are most useful when comparing competing models for the same data. The model with the largest R2 statistic is “best” according to this measure. Classification The classification table shows the practical results of using the logistic regression model. For each case, the predicted response is Yes if that cases's model-predicted probability is greater than the cutoff value specified in the dialogs (in this case, the default of 0.5). 22 • Cells on the diagonal are correct predictions. • Cells off the diagonal are incorrect predictions Of the cases used to create the model, 57 of the 124 people who previously defaulted are classified correctly. 352 of the 375 non-defaulters are classified correctly. Overall, 82.0% of the cases are classified correctly. From step to step, the improvement in classification indicates how well your model performs. A better model should correctly identify a higher percentage of the cases. Classifications based upon the cases used to create the model tend to be too "optimistic" in the sense that their classification rate is inflated. Subset validation is obtained by classifying past customers who were not used to create the model. These results are shown in the Unselected Cases section of the table. 80.6 percent of these cases were correctly classified by the model. This suggests that, overall, your model is in fact correct about four out of five times. Even the subset validation is limited, however, because it is based on only one cutoff value. 23 Logistic Regression Coefficients The parameter estimates table summarizes the effect of each predictor. The ratio of the coefficient to its standard error, squared, equals the Wald statistic. If the significance level of the Wald statistic is small (less than 0.05) then the parameter is useful to the model. The predictors and coefficient values shown in the last step are used by the procedure to make predictions The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Exp(B) is easier to interpret. Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit change in the predictor. For example, Exp(B) for employ is equal to 0.781, which means that the odds of default for a person who has been employed at their current job for two years are 0.781 times the odds of default for a person who has been employed at their current job for 1 year, all other things being equal What this difference means in terms of probability depends upon the original probability of default for the person employed for 1 year. What this difference means in terms of probability depends upon the original probability of default for the person employed for 1 year. Odds (default) = P(default) P(default) P(no default) 1 P(default) In the case of a customer whose probability of default is 0.5, the odds she will default are related to the probability by this equation. Thus, her corresponding odds of default are 1. 24 The odds of default for a person with 1 more year on the job are 1*0.781 = 0.781, so the corresponding probability of default reduces to 0.439. In the case of a customer whose probability of default is 0.9, his odds of default are 9. The odds of default for a person with 1 more year on the job are 9*0.781 = 7.029, so the corresponding probability of default reduces to 0.875. The change in probability of default is greater for the person who originally had a 0.5 probability of default. In general, the change will be greater when the original probability is near 0.5, and smaller when it is close to 0 or 1. Summary Using the Logistic Regression Procedure, you have constructed a model for predicting the probability a given customer will default on their loan. A critical issue for loan officers is the cost of Type I and Type II errors. That is, what is the cost of classifying a defaulter as a non-defaulter (Type I)? What is the cost of classifying a non-defaulter as a defaulter (Type II)? If bad debt is the primary concern, then you want to lower your Type I error and maximize your sensitivity. If growing your customer base is the priority, then you want to lower your Type II error and maximize your specificity. Usually both are major concerns, so you have to choose a decision rule for classifying customers that gives the best mix of sensitivity and specificity. Related Procedures The Binary Logistic Regression procedure is a useful tool for predicting the value of a categorical response variable with two possible outcomes. • If there are more than two possible outcomes and they do not have an inherent ordering, use the Multinomial Logistic Regression procedure. Multinomial logistic regression can also be used as an alternative when there are only two outcomes. • If there are more than two possible outcomes and they are ordered, use the Ordinal Regression procedure. • If there are more than two possible outcomes and your predictors can be considered scale, the Discriminat Analysis procedure can be useful. Discriminant analysis can also be used when there are only two outcomes. • If the dependent variable and predictors are scale, use the Linear Regression procedure. • If the dependent variable is scale and some or all predictors are categorical, use the GLM Univariate procedure. 25 • In order to use a tool that is more flexible than the classification table, save the model-predicted probabilities and use the ROC Curve procedure. The ROC Curve provides an index of accuracy by demonstrating your model's ability to discriminate between two groups. You can think of it as a visualization of classification tables for all possible cutoffs. Recommended Readings See the following texts for more information logistic regression: Hosmer, D. W., and S. Lemeshow. 2000. Applied Logistic Regression, 2nd ed. New York: John Wiley and Sons. Kleinbaum, D. G. 1994. Logistic Regression: A Self-Learning Text. New York: Springer-Verlag. Jennings, D. E. 1986. Outliers and residual distributions in logistic regression. Journal of the American Statistical Association, 81:, 987-990. Norusis, M. 2004. SPSS 13.0 Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc.. 26