Document
Transcription
Document
ECON 2202: Statistical Methods in Economics and Business II Lecture 23: Introduction to Linear Regression and Correlation Analysis (IV) Multiple Regression Analysis and Model Building (I) 1 Introduction So far, we have focused on two populations: One dependent variable and one independent variable and tried to answer the following questions: Question 1 Is there any linear relationship between them? We can use the sample correlation coefficient P ( − ¯) ( − ¯) r =r P P 2 ( − ¯) ( − ¯)2 =1 =1 =1 to measure it. Due to sampling error, we have to test the following hypothesis ½ 0 : = 0 : 6= 0 = 1−2 −2 = − 2 Question 2 After making sure that there is some linear relationship between and how to derive it? We use the following steps to derive it: Step 1: Based on some knowledge, specify which one is dependent variable, and which one is independent variable. Use to represent dependent variable Use to represent independent variable. Step 2: Assume the linear relationship as follows = 0 + 1 + 1 Step 3: Collect data {( ) : = 1 2 } Step 4: Use the LSE to derive the estimated linear relationship ˆ = 0 + 1 That is, choose 0 and 1 such that the total square of “mistake” min 0 1 X =1 ( − 0 − 1 )2 After solving the two first order conditions, the solutions are ⎧ ( −¯ )( −¯ ) ⎪ ⎪ =1 ⎪ ⎪ = 1 ⎨ 2 ⎪ ⎪ ⎪ ⎪ ⎩ ( −¯ ) =1 0 = ¯ − 1 ¯ Step 4: Test the following hypothesis ½ 0 : 1 = 0 : 1 = 6 0 = 1 X )2 ( −¯ =1 = − 2 or ½ 0 : 21 = 0 : 21 0 = 1 = 2 − 1 = 1 and 2 = − 2 After 0 is rejected, then there is a linear relationship between and . Moreover, the linear relationship is ˆ = 0 + 1 Under the four assumptions, we can prove that the LSE estimates 0 and 1 are the best estimates of the two population parameters 0 and 1 2 It is noted even though that the model has passed the test, however, the coefficient of determination 2 = still could be small. That is, there is something wrong. This leads us the following question: Question 3 Suppose that the model has passed the test, but the coefficient of determination 2 = is still small. 1. Why? 2. What should we do next? Of course, there are many possibilities: For example, Possibility I: There is some non linear relationship between and Possibility II: There are more independent variables 2 3 which also have linear relationship with the dependent variable We will focus on the second possibility here and this leads us to study multiple regression analysis and model building. 2 Multiple Regression Analysis Multiple regression analysis is the study of how a dependent variable is related to two or more independent variables. In the general case, we will use to denote the number of independent variables. 2.1 Regression Model and Regression Equation The concepts of a regression model and a regression equation introduced in the preceding lecture notes are applicable in the multiple regression case. The equation that describes how the dependent variable is related to the independent variables 1 2 and an error term is called the multiple regression model. We begin with the assumption that the multiple regression model takes the following form. Multiple Linear Regression Model (Population Model or True Model) = 0 + 1 1 + 2 2 + · · · + + In the multiple regression model, 0 , 1 , 2 , . . . , are the parameters and the error term is a random variable. A close examination of this model reveals that is a linear function of 1 2 3 • 0 + 1 1 + 2 3 + · · · + ; and • plus an error term . The error term accounts for the variability in that cannot be explained by the linear effect of the independent variables. Four assumptions similar to those that apply to the simple linear regression model must also apply to the multiple regression model. Assumption 1 The error term is a random variable with a mean zero; that is, () = 0 Implication: For given values of 1 2 , the expected, or average, value of is given by () = 0 + 1 1 + 2 2 + · · · + In this equation, () represents the average of all possible values of that might occur for the given values of 1 2 . Assumption 2 The variance of error term is denoted by 2 and is the same for all values of the independent variables 1 2 . Implication: The variance of about the regression line equals 2 and is the same for all values of 1 2 . Assumption 3 The values of are independent. Implication: The value of for a particular set of values for the independent variables is not related to the value of for any other set of values. Assumption 4 The error term is a normally distributed random variable reflecting the deviation between the value and the expected value of given by 0 + 1 1 + 2 2 + · · · + Implication: Because 0 , 1 , 2 , . . . , are constants for the given values of 1 , 1 2 , the dependent variable is also a normally distributed random variable. To obtain more insight about the form of the relationship given by equation () = 0 + 1 1 + 2 2 + · · · + consider the following two-independent-variable multiple regression equation () = 0 + 1 1 + 2 2 4 The graph of this equation is a plane in three-dimensional space. The following figure provides an example of such a graph Note that the value of shown is the difference between the actual value and the expected value of , (), when 1 = ∗1 and 2 = ∗2 . The equation that describes how the mean value of is related to 1 2 is called the multiple regression equation. Multiple Linear Regression Equation () = 0 + 1 1 + 2 2 + · · · + 2.2 Estimated Multiple Regression Equation If the values of 0 , 1 , 2 , . . . , were known, the above equation could be used to compute the mean value of at given values of 1 2 . Unfortunately, these parameter values will not, in general, be known and must be estimated from sample data. A simple random sample is used to compute sample statistics 0 1 2 that are used as the point estimators of the parameters 0 , 1 , 2 , . . . , . These sample statistics provide the 5 following estimated multiple regression equation. Estimated Simple Linear Regression Equation (Sample Model) where ˆ = 0 + 1 1 + 2 2 + · · · + 0 1 2 are the point estimators of the parameters 0 , 1 , 2 , . . . , ˆ is the estimated value of the dependent variable The estimation process for multiple regression is shown in the following figure 6 2.3 Least Squares Method In the earlier lecture notes, we used the least squares method to develop the estimated regression equation that best approximated the straight-line relationship between the dependent and independent variables. This same approach is used to develop the estimated multiple regression equation. The least squares criterion is restated as follows. Least Squares Criterion min 0 1 X =1 ( − ˆ )2 = X =1 ( − 0 − 1 1 − 2 2 − · · · − )2 where = observed value of the dependent variable for the th observation ˆ = estimated value of the dependent variable for the th observation = 0 + 1 1 + 2 2 + · · · + That is, the LSE 0 1 2 are the solutions to the following + 1 first order conditions ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ X =1 ( −ˆ )2 =0 0 X =1 ( −ˆ )2 1 X =1 =0 .. . ( −ˆ )2 =0 The least squares method uses sample data to provide the values of 0 1 2 that make the sum of squared residuals [the deviations between the observed values of the dependent variable ( ) and the estimated values of the dependent variable (ˆ )] a minimum. In the earlier lecture notes, we presented formulas for computing the least squares estimators 0 and 1 for the estimated simple linear regression equation 0 + 1 . With relatively small data sets, we were able to use those formulas to compute 0 and 1 by manual calculations. In multiple regression, however, the presentation of the formulas for the regression coefficients 0 1 2 involves the use of matrix algebra and is beyond the scope of this text. The emphasis will be on how to interpret the computer output rather than on how to make the multiple regression computations. This estimated model is an extension of an estimated simple regression model. The principal difference is that whereas the estimated simple regression model is the equation for 7 a straight line in a two-dimensional space, the estimated multiple regression model forms a hyperplane (or response surface) through multidimensional space. Each regression coefficient represents a different slope. The regression hyperplane represents the relationship between the dependent variable and the independent variables. Example 1 The following table shows sample data for a dependent variable, , and one independent variable, 1 Using one independent variable 1 the LSE is the following ˆ = 0 + 1 1 = 46389 + 2671 The following figure shows a scatter plot and the regression line for the simple regression analysis for and 1 8 The points are plotted in two-dimensional space, and the regression model is represented by a line through the points such that the sum of squared errors X = ( − 0 − 1 1 )2 is minimized. If we add variable 2 to the model, as shown in the following table, 9 the resulting multiple regression equation becomes ˆ = 0 + 1 1 + 2 2 = 30771 + 2851 + 10942 Note, however, that the ( 1 2 ) points form a three-dimensional space, as shown in the following figure 10 The regression equation forms a slice (hyperplane) through the data such that X = ( − 0 − 1 1 − 2 2 )2 is minimized. This is the same least squares criterion that is used with simple linear regression. 3 Basic Model-Building Concepts An important activity in business decision making is referred to as model building. Models are often used to test changes in a system without actually having to change the real system. Models are also used to help describe a system or to predict the output of a system based on certain specified inputs. You are probably quite aware of physical models. Airlines use flight simulators to train pilots. Wind tunnels are used to determine the aerodynamics of automobile designs. Golf ball makers use a physical model of a golfer called Iron Mike that can be set to swing golf clubs in a very controlled manner to determine how far a golf ball will fly. 11 Although physical models are very useful in business decision making, our emphasis in this chapter is on statistical models that are developed using multiple regression analysis. Modeling is both an art and a science. Determining an appropriate model is a challenging task, but it can be made manageable by employing a model-building process consisting of the following three components: 1. Model specification; 2. Model building; and 3. Model diagnosis. Model Specification or model identification, is the process of determining the dependent variable, deciding which independent variables should be included in the model, and obtaining the sample data for all variables. As with any statistical procedure, the larger the sample size the better, because the potential for extreme sampling error is reduced when the sample size is large. However, at a minimum, the sample size required to compute a regression model must be at least one greater than the number of independent variables. If we are thinking of developing a regression model with five independent variables, the absolute minimum number of cases required is six. Otherwise, the computer software will indicate an error has been made or will print out meaningless values. However, as a practical matter, the sample size should be at least four times the number of independent variables. Thus, if we had five independent variables ( = 5), we would want a sample of at least 20. Model Building is the process of actually constructing a mathematical equation in which some or all of the independent variables are used in an attempt to explain the variation in the dependent variable. Model Diagnosis is the process of analyzing the quality of the model you have constructed by determining how well a specified model fits the data you just gathered. You will examine output values such as -squared and the standard error of the model. At this stage, you will also assess the extent to which the model’s assumptions appear to be satisfied. If the model is unacceptable in any of these areas, you will be forced to revert to the model-specification step and begin again. However, you will be the final judge of whether the model provides acceptable results, and you will always be constrained by time and cost considerations. You should use the simplest available model that will meet your needs. The objective of model building is to help you make better decisions. You do not need to feel that a sophisticated model is better if a simpler one will provide acceptable results. We next use an example to show you how to build a model. 12 4 An Example In this section, we will use an example to show you how to construct a model. Example 2 (First City Real Estate) First City Real Estate executives wish to build a model to predict sales prices for residential property. Such a model will be valuable when working with potential sellers who might list their homes with First City. This can be done using the following steps: Step 1: Model Specification. The question being asked is how can the real estate firm determine the selling price for a house? Thus, the dependent variable is the sales price. This is what the managers want to be able to predict. The managers met in a brainstorming session to determine a list of possible independent (explanatory) variables. Some variables, such as “condition of the house,” were eliminated because of lack of data. Others, such as “curb appeal” (the appeal of the house to people as they drive by), were eliminated because the values for these variables would be too subjective and difficult to quantify. From a wide list of possibilities, the managers selected the following variables as good candidates: 1 = Home size (in square feet) 2 = Age of house 3 = Number of bedrooms 4 = Number of bathrooms 5 = Garage size (number of cars) Data were obtained for a sample of 319 residential properties that had sold within the previous two months in an area served by two of First City’s offices. For each house in the sample, the sales price and values for each potential independent variable were collected. Step 2: Model Building. The regression model is developed by including independent variables from among those for which you have complete data. There is no way to determine whether an independent variable will be a good predictor variable by analyzing the individual variable’s descriptive statistics, such as the mean and standard deviation. Instead, we need to look at the correlation between the independent variables and the dependent variable, which is measured by the correlation coefficient. When we have multiple independent variables and one dependent variable, we can look at the correlation between all pairs of variables by developing a correlation matrix. Each 13 correlation is computed using one of the equations Correlation Coefficient )( −¯ ( −¯ √ ) 2 )2 ) ( −¯ ( −¯ = √ ( −¯ )( −¯ √ ) 2 )2 ) ( −¯ ( −¯ or = √ One variable with One variable with another The appropriate formula is determined by whether the correlation is being calculated for an independent variable and the dependent variable or for two independent variables. The actual calculations are done using Excel’s correlation tool, and the result is shown in the following figure The output provides the correlation between and each variable and between each pair of independent variables. Recall that in the previous lecture notes, a -test was used to test whether the correlation coefficient is statistically significant. 0 : = 0 : 6= 0 We will conduct the test with a significance level of = 005 Given degrees of freedom equal to − 2 = 319 − 2 = 317 the critical ¯2 for a two-tailed test is approximately 196. 14 Any correlation coefficient generating a -value greater than 196 or less than −196 is determined to be significant. For now, we will focus on the correlations in the first column in the above figure, which measures the strength of the linear relationship between each independent variable and the dependent variable, sales price. For example, the statistic for price and square feet is Because = q 1−2 −2 07477 =q 1−074772 319−2 = 20048 = 20048 ¯2 = 196 we reject 0 and conclude that the correlation between sales price and square feet is statistically significant. Similar calculations for the other independent variables with price show that all variables are statistically correlated with price. This indicates that a significant linear relationship exists between each independent variable and sales price. 1. Variable 1 , square feet, has the highest correlation at 0.748. 2. Variable 2 , age of the house, has the lowest correlation at −0485. The negative correlation implies that older homes tend to have lower sales prices. As we discussed in the previous lecture notes, it is always a good idea to develop scatter plots to visualize the relationship between two variables. The following figure shows the scatter plots for each independent variable and the dependent variable, sales price 15 In each case, the plots indicate a linear relationship between the independent variable and the dependent variable. Note that several of the independent variables (bedrooms, bathrooms, garage size) are quantitative but discrete. The scatter plots for these variables show points at each level of the independent variable rather than over a continuum of values. 4.1 Computing the Regression Equation First City’s goal is to develop a regression model to predict the appropriate selling price for a home, using certain measurable characteristics. The first attempt at developing the model will be to run a multiple regression computer program using all available independent variables. The regression outputs from Excel are shown in the following figure 16 The estimate of the multiple regression model given in the above figure is ˆ = 0 + 1 1 + 2 2 + 3 3 + 4 4 + 5 5 = 311276 + 631 × (sq.ft.) − 11444 × (age) − 8 4104 × (bedrooms) +35220 × (bathrooms) + 282035 × (garage) The coefficients for each independent variable represent an estimate of the average change in the dependent variable for a 1-unit change in the independent variable, holding all other independent variables constant. For example, for houses of the same age, with the same number of bedrooms, baths, and garage size, a l-square-foot increase in the size of the house is estimated to increase its price by an average of $63 .10. Likewise, for houses with the same square footage, bedrooms, bathrooms, and garages, a 1-year increase in the age of the house is estimated to result in an average drop in sales price of $1144.40. The other coefficients are interpreted in the same way. Note, in each case, we are interpreting the regression coefficient for one independent variable while holding the other variables constant. To estimate the value of a residential property, First City Real Estate broker would substitute values for the independent variables into the regression equation. For example, 17 suppose a house with the following characteristics is considered: 1 2 3 4 5 = = = = = Square feet = 2,100 Age = 15 Number of bedrooms = 4 Number of bathrooms = 3 Size of garage = 2 The point estimate for the sales price is ˆ = 311276 + 631 × 2100 − 11444 × 15 − 8 4104 × 4 + 35220 × 3 + 282035 × 2 = $17980270 4.2 The Coefficient of Determination You learned in the previous lecture notes that the coefficient of determination, 2 , measures the proportion of variation in the dependent variable that can be explained by the dependent variable’s relationship to a single independent variable. When there are multiple independent variables in a model, 2 is called the multiple coefficient of determination and is used to determine the proportion of variation in the dependent variable that is explained by the dependent variable’s relationship to all the independent variables in the model. Multiple Coefficient of Determination (2 ) 2 = Sum of squares regression Total sum of squares = As shown in the above figure, 2 = 08161 Both and are also included in the output. We can also use 2 = to get 2 , as follows: = 10389 = 127303 10389 = = 08161 2 = 127303 More than 81 % of the variation in sales price can be explained by the linear relationship of the five independent variables in the regression model to the dependent variable. However, as we shall shortly see, not all independent variables are equally important to the model’s ability to explain this variation. 18 Should we stop here? In other words, should the manager be satisfied with the big value of 2 ? Of course, the answer is no since so far what we have done just a point estimation. To get a satisfying result, we have to test. More specifically, before First City actually uses this regression model to estimate the sales price of a house, there are several questions that should be answered 1. Is the overall model significant? 2. Are the individual variables significant? 3. Is the standard deviation of the model error too large to provide meaningful results? 4. Is multicollinearity a problem? 5. Have the regression analysis assumptions been satisfied? We shall answer the five questions in the next lecture. Practice Problems 7. Problem 15.1 on page 628. (Use alpha=0.05 if needed) 8. Problem 15.4 on page 629. 19