Document

Transcription

Document
ECON 2202: Statistical Methods in Economics and Business II
Lecture 23: Introduction to Linear Regression and Correlation Analysis (IV)
Multiple Regression Analysis and Model Building (I)
1
Introduction
So far, we have focused on two populations: One dependent variable  and one independent
variable  and tried to answer the following questions:
Question 1 Is there any linear relationship between them?
We can use the sample correlation coefficient


P
( − ¯) ( − ¯)
r
=r
P
P
2
( − ¯)
( − ¯)2
=1
=1
=1
to measure it.
Due to sampling error, we have to test the following hypothesis
½
0 :  = 0
 :  6= 0
=
 
1−2
−2
 =  − 2
Question 2 After making sure that there is some linear relationship between  and  how
to derive it?
We use the following steps to derive it:
Step 1: Based on some knowledge, specify which one is dependent variable, and
which one is independent variable.
Use  to represent dependent variable
Use  to represent independent variable.
Step 2: Assume the linear relationship as follows
 = 0 + 1 + 
1
Step 3: Collect data
{(   ) :  = 1 2  }
Step 4: Use the LSE to derive the estimated linear relationship
ˆ = 0 + 1 
That is, choose 0 and 1 such that the total square of “mistake”
min
0 1

X
=1
( − 0 − 1  )2 
After solving the two first order conditions, the solutions are
⎧


( −¯
)( −¯
)
⎪
⎪
=1
⎪

⎪

=
1
⎨


2
⎪
⎪
⎪
⎪
⎩
( −¯
)
=1
0 = ¯ − 1 ¯
Step 4: Test the following hypothesis
½
0 :  1 = 0
 :  1 =
6 0
=
1








X

)2
( −¯
=1
 =  − 2
or
½
0 :  21 = 0
 :  21  0

 = 
1 = 2 − 1 = 1 and
2 =  − 2
After 0 is rejected, then there is a linear relationship between  and . Moreover, the
linear relationship is
ˆ = 0 + 1 
Under the four assumptions, we can prove that the LSE estimates 0 and 1 are the best
estimates of the two population parameters  0 and  1 
2
It is noted even though that the model has passed the test, however, the coefficient of
determination

2 =

still could be small. That is, there is something wrong. This leads us the following question:
Question 3 Suppose that the model has passed the test, but the coefficient of determination
2 = 
is still small.

1. Why?
2. What should we do next?
Of course, there are many possibilities: For example,
Possibility I: There is some non linear relationship between  and 
Possibility II: There are more independent variables 2  3    which also have
linear relationship with the dependent variable 
We will focus on the second possibility here and this leads us to study multiple regression
analysis and model building.
2
Multiple Regression Analysis
Multiple regression analysis is the study of how a dependent variable  is related to two
or more independent variables. In the general case, we will use  to denote the number of
independent variables.
2.1
Regression Model and Regression Equation
The concepts of a regression model and a regression equation introduced in the preceding
lecture notes are applicable in the multiple regression case. The equation that describes how
the dependent variable  is related to the independent variables 1  2    and an error
term  is called the multiple regression model.
We begin with the assumption that the multiple regression model takes the following
form.
Multiple Linear Regression Model (Population Model or True Model)
 =  0 +  1 1 +  2 2 + · · · +    + 
In the multiple regression model,  0 ,  1 ,  2 , . . . ,   are the parameters and the error
term  is a random variable.
A close examination of this model reveals that  is a linear function of 1  2   
3
•  0 +  1 1 +  2 3 + · · · +    ; and
• plus an error term .
The error term accounts for the variability in  that cannot be explained by the linear
effect of the  independent variables.
Four assumptions similar to those that apply to the simple linear regression model must
also apply to the multiple regression model.
Assumption 1 The error term  is a random variable with a mean zero; that is,
() = 0
Implication: For given values of 1  2    , the expected, or average, value of  is given
by
() =  0 +  1 1 +  2 2 + · · · +    
In this equation, () represents the average of all possible values of  that might occur for
the given values of 1  2    .
Assumption 2 The variance of error term  is denoted by  2 and is the same for all values
of the independent variables 1  2    .
Implication: The variance of  about the regression line equals  2 and is the same for all
values of 1  2    .
Assumption 3 The values of  are independent.
Implication: The value of  for a particular set of values for the independent variables is
not related to the value of  for any other set of values.
Assumption 4 The error term  is a normally distributed random variable reflecting the
deviation between the  value and the expected value of  given by
 0 +  1 1 +  2 2 + · · · +    
Implication: Because  0 ,  1 ,  2 , . . . ,   are constants for the given values of 1 ,
1  2    , the dependent variable  is also a normally distributed random variable.
To obtain more insight about the form of the relationship given by equation
() =  0 +  1 1 +  2 2 + · · · +    
consider the following two-independent-variable multiple regression equation
() =  0 +  1 1 +  2 2 
4
The graph of this equation is a plane in three-dimensional space. The following figure
provides an example of such a graph
Note that the value of  shown is the difference between the actual  value and the
expected value of , (), when 1 = ∗1 and 2 = ∗2 .
The equation that describes how the mean value of  is related to 1  2    is called
the multiple regression equation.
Multiple Linear Regression Equation
() =  0 +  1 1 +  2 2 + · · · +   
2.2
Estimated Multiple Regression Equation
If the values of  0 ,  1 ,  2 , . . . ,   were known, the above equation could be used to
compute the mean value of  at given values of 1  2    . Unfortunately, these parameter
values will not, in general, be known and must be estimated from sample data. A simple
random sample is used to compute sample statistics 0  1  2    that are used as the point
estimators of the parameters  0 ,  1 ,  2 , . . . ,   . These sample statistics provide the
5
following estimated multiple regression equation.
Estimated Simple Linear Regression Equation (Sample Model)
where
ˆ = 0 + 1 1 + 2 2 + · · · +  
0  1  2    are the point estimators of the parameters  0 ,  1 ,  2 , . . . ,  
ˆ is the estimated value of the dependent variable
The estimation process for multiple regression is shown in the following figure
6
2.3
Least Squares Method
In the earlier lecture notes, we used the least squares method to develop the estimated regression equation that best approximated the straight-line relationship between the dependent
and independent variables. This same approach is used to develop the estimated multiple
regression equation. The least squares criterion is restated as follows.
Least Squares Criterion
min
0 1 

X
=1
( − ˆ )2 =

X
=1
( − 0 − 1 1 − 2 2 − · · · −   )2
where
 = observed value of the dependent variable for the th observation
ˆ = estimated value of the dependent variable for the th observation
= 0 + 1 1 + 2 2 + · · · +  
That is, the LSE 0  1  2    are the solutions to the following  + 1 first order conditions
⎧
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩


X
=1
( −ˆ
 )2
=0
0


X
=1
( −ˆ
 )2
1


X
=1
=0
..
.
( −ˆ
 )2

=0
The least squares method uses sample data to provide the values of 0  1  2    that
make the sum of squared residuals [the deviations between the observed values of the dependent variable ( ) and the estimated values of the dependent variable (ˆ
 )] a minimum.
In the earlier lecture notes, we presented formulas for computing the least squares estimators 0 and 1 for the estimated simple linear regression equation 0 + 1 . With relatively
small data sets, we were able to use those formulas to compute 0 and 1 by manual calculations.
In multiple regression, however, the presentation of the formulas for the regression coefficients 0  1  2    involves the use of matrix algebra and is beyond the scope of this
text.
The emphasis will be on how to interpret the computer output rather than on how to
make the multiple regression computations.
This estimated model is an extension of an estimated simple regression model. The
principal difference is that whereas the estimated simple regression model is the equation for
7
a straight line in a two-dimensional space, the estimated multiple regression model forms a
hyperplane (or response surface) through multidimensional space. Each regression coefficient
represents a different slope. The regression hyperplane represents the relationship between
the dependent variable and the  independent variables.
Example 1 The following table shows sample data for a dependent variable, , and one
independent variable, 1
Using one independent variable 1  the LSE is the following
ˆ = 0 + 1 1 = 46389 + 2671
The following figure shows a scatter plot and the regression line for the simple regression
analysis for  and 1
8
The points are plotted in two-dimensional space, and the regression model is represented by
a line through the points such that the sum of squared errors
X
 =
( − 0 − 1 1 )2
is minimized.
If we add variable 2 to the model, as shown in the following table,
9
the resulting multiple regression equation becomes
ˆ = 0 + 1 1 + 2 2 = 30771 + 2851 + 10942
Note, however, that the ( 1  2 ) points form a three-dimensional space, as shown in the
following figure
10
The regression equation forms a slice (hyperplane) through the data such that
X
 =
( − 0 − 1 1 − 2 2 )2
is minimized. This is the same least squares criterion that is used with simple linear regression.
3
Basic Model-Building Concepts
An important activity in business decision making is referred to as model building. Models
are often used to test changes in a system without actually having to change the real system.
Models are also used to help describe a system or to predict the output of a system based
on certain specified inputs. You are probably quite aware of physical models. Airlines use
flight simulators to train pilots. Wind tunnels are used to determine the aerodynamics of
automobile designs. Golf ball makers use a physical model of a golfer called Iron Mike that
can be set to swing golf clubs in a very controlled manner to determine how far a golf ball
will fly.
11
Although physical models are very useful in business decision making, our emphasis in
this chapter is on statistical models that are developed using multiple regression analysis.
Modeling is both an art and a science. Determining an appropriate model is a
challenging task, but it can be made manageable by employing a model-building process
consisting of the following three components:
1. Model specification;
2. Model building; and
3. Model diagnosis.
Model Specification or model identification, is the process of determining the dependent variable, deciding which independent variables should be included in the model, and
obtaining the sample data for all variables. As with any statistical procedure, the larger the
sample size the better, because the potential for extreme sampling error is reduced when the
sample size is large. However, at a minimum, the sample size required to compute a
regression model must be at least one greater than the number of independent
variables.
If we are thinking of developing a regression model with five independent variables, the
absolute minimum number of cases required is six. Otherwise, the computer software will
indicate an error has been made or will print out meaningless values. However, as a practical
matter, the sample size should be at least four times the number of independent variables.
Thus, if we had five independent variables ( = 5), we would want a sample of at least
20.
Model Building is the process of actually constructing a mathematical equation in
which some or all of the independent variables are used in an attempt to explain the variation
in the dependent variable.
Model Diagnosis is the process of analyzing the quality of the model you have constructed by determining how well a specified model fits the data you just gathered.
You will examine output values such as -squared and the standard error of the model.
At this stage, you will also assess the extent to which the model’s assumptions appear to be
satisfied.
If the model is unacceptable in any of these areas, you will be forced to revert to the
model-specification step and begin again. However, you will be the final judge of whether
the model provides acceptable results, and you will always be constrained by time and cost
considerations.
You should use the simplest available model that will meet your needs. The objective
of model building is to help you make better decisions. You do not need to feel that a
sophisticated model is better if a simpler one will provide acceptable results.
We next use an example to show you how to build a model.
12
4
An Example
In this section, we will use an example to show you how to construct a model.
Example 2 (First City Real Estate) First City Real Estate executives wish to build a model
to predict sales prices for residential property.
Such a model will be valuable when working with potential sellers who might list their
homes with First City. This can be done using the following steps:
Step 1: Model Specification.
The question being asked is how can the real estate firm determine the selling price for
a house? Thus, the dependent variable is the sales price. This is what the managers want
to be able to predict. The managers met in a brainstorming session to determine a list of
possible independent (explanatory) variables.
Some variables, such as “condition of the house,” were eliminated because of lack of data.
Others, such as “curb appeal” (the appeal of the house to people as they drive by), were
eliminated because the values for these variables would be too subjective and difficult to
quantify.
From a wide list of possibilities, the managers selected the following variables as good
candidates:
1 = Home size (in square feet)
2 = Age of house
3 = Number of bedrooms
4 = Number of bathrooms
5 = Garage size (number of cars)
Data were obtained for a sample of 319 residential properties that had sold within the
previous two months in an area served by two of First City’s offices. For each house in the
sample, the sales price and values for each potential independent variable were collected.
Step 2: Model Building.
The regression model is developed by including independent variables from among those
for which you have complete data. There is no way to determine whether an independent
variable will be a good predictor variable by analyzing the individual variable’s descriptive
statistics, such as the mean and standard deviation. Instead, we need to look at the correlation between the independent variables and the dependent variable, which is measured by
the correlation coefficient.
When we have multiple independent variables and one dependent variable, we can look
at the correlation between all pairs of variables by developing a correlation matrix. Each
13
correlation is computed using one of the equations
Correlation Coefficient

)( −¯
( −¯
√ ) 2
)2
)
( −¯
( −¯
  = √

( −¯
 )( −¯
√  ) 2
 )2
 )
( −¯
( −¯
or   = √
One  variable with 
One  variable with another 
The appropriate formula is determined by whether the correlation is being calculated for an
independent variable and the dependent variable or for two independent variables.
The actual calculations are done using Excel’s correlation tool, and the result is shown
in the following figure
The output provides the correlation between  and each  variable and between each pair
of independent variables. Recall that in the previous lecture notes, a -test was used to test
whether the correlation coefficient is statistically significant.
0 :  = 0
 :  6= 0
We will conduct the test with a significance level of
 = 005
Given degrees of freedom equal to
 − 2 = 319 − 2 = 317
the critical ¯2 for a two-tailed test is approximately 196.
14
Any correlation coefficient generating a -value greater than 196 or less than −196 is
determined to be significant.
For now, we will focus on the correlations in the first column in the above figure, which
measures the strength of the linear relationship between each independent variable and the
dependent variable, sales price. For example, the  statistic for price and square feet is
Because
= q

1−2
−2
07477
=q
1−074772
319−2
= 20048
 = 20048  ¯2 = 196
we reject 0 and conclude that the correlation between sales price and square feet is statistically significant.
Similar calculations for the other independent variables with price show that all variables
are statistically correlated with price. This indicates that a significant linear relationship
exists between each independent variable and sales price.
1. Variable 1 , square feet, has the highest correlation at 0.748.
2. Variable 2 , age of the house, has the lowest correlation at −0485. The negative
correlation implies that older homes tend to have lower sales prices.
As we discussed in the previous lecture notes, it is always a good idea to develop scatter
plots to visualize the relationship between two variables. The following figure shows the
scatter plots for each independent variable and the dependent variable, sales price
15
In each case, the plots indicate a linear relationship between the independent variable and
the dependent variable.
Note that several of the independent variables (bedrooms, bathrooms, garage size) are
quantitative but discrete. The scatter plots for these variables show points at each level of
the independent variable rather than over a continuum of values.
4.1
Computing the Regression Equation
First City’s goal is to develop a regression model to predict the appropriate selling price for
a home, using certain measurable characteristics.
The first attempt at developing the model will be to run a multiple regression computer
program using all available independent variables. The regression outputs from Excel are
shown in the following figure
16
The estimate of the multiple regression model given in the above figure is
ˆ = 0 + 1 1 + 2 2 + 3 3 + 4 4 + 5 5
= 311276 + 631 × (sq.ft.) − 11444 × (age) − 8 4104 × (bedrooms)
+35220 × (bathrooms) + 282035 × (garage)
The coefficients for each independent variable represent an estimate of the average change
in the dependent variable for a 1-unit change in the independent variable, holding all other
independent variables constant.
For example, for houses of the same age, with the same number of bedrooms, baths, and
garage size, a l-square-foot increase in the size of the house is estimated to increase its price
by an average of $63 .10.
Likewise, for houses with the same square footage, bedrooms, bathrooms, and garages,
a 1-year increase in the age of the house is estimated to result in an average drop in sales
price of $1144.40.
The other coefficients are interpreted in the same way.
Note, in each case, we are interpreting the regression coefficient for one independent
variable while holding the other variables constant.
To estimate the value of a residential property, First City Real Estate broker would
substitute values for the independent variables into the regression equation. For example,
17
suppose a house with the following characteristics is considered:
1
2
3
4
5
=
=
=
=
=
Square feet = 2,100
Age = 15
Number of bedrooms = 4
Number of bathrooms = 3
Size of garage = 2
The point estimate for the sales price is
ˆ = 311276 + 631 × 2100 − 11444 × 15 − 8 4104 × 4 + 35220 × 3 + 282035 × 2
= $17980270
4.2
The Coefficient of Determination
You learned in the previous lecture notes that the coefficient of determination, 2 , measures
the proportion of variation in the dependent variable that can be explained by the dependent
variable’s relationship to a single independent variable. When there are multiple independent
variables in a model, 2 is called the multiple coefficient of determination and is used
to determine the proportion of variation in the dependent variable that is explained by the
dependent variable’s relationship to all the independent variables in the model.
Multiple Coefficient of Determination (2 )
2 =
Sum of squares regression
Total sum of squares
=


As shown in the above figure,
2 = 08161
Both  and  are also included in the output.
We can also use

2 =

to get 2 , as follows:
 = 10389
 = 127303
10389

=
= 08161
2 =

127303
More than 81 % of the variation in sales price can be explained by the linear relationship of
the five independent variables in the regression model to the dependent variable. However,
as we shall shortly see, not all independent variables are equally important to the model’s
ability to explain this variation.
18
Should we stop here? In other words, should the manager be satisfied with the big value
of 2 ?
Of course, the answer is no since so far what we have done just a point estimation. To
get a satisfying result, we have to test.
More specifically, before First City actually uses this regression model to estimate the
sales price of a house, there are several questions that should be answered
1. Is the overall model significant?
2. Are the individual variables significant?
3. Is the standard deviation of the model error too large to provide meaningful
results?
4. Is multicollinearity a problem?
5. Have the regression analysis assumptions been satisfied?
We shall answer the five questions in the next lecture.
Practice Problems
7. Problem 15.1 on page 628. (Use alpha=0.05 if needed)
8. Problem 15.4 on page 629.
19