Multiple Regression STAT E-150 Statistical Methods

Transcription

Multiple Regression STAT E-150 Statistical Methods
STAT E-150
Statistical Methods
Multiple Regression
Three percent of a man's body is essential fat, which is necessary for a
healthy body. However, too much body fat can be dangerous. For men
between the ages of 18 and 39, a healthy body fat percent is 8% to
19%. (For women it is 21% to 32%.)
It is not easy to measure body fat percent, but we can find a model for
the relationship between body fat percent and waist size and use it to
find the body weight percent associated with a given waist size.
2
The scatterplot indicates a positive linear relationship between waist size
and body fat percent:
3
The SPSS output shows a significant linear relationship between the two
variables.
Coefficientsa
Unstandardized Coefficients
B
Model
1
(Constant)
Waist
Std. Error
-42.734
2.717
1.700
.074
Standardized
Coefficients
Beta
t
.824
Sig.
-15.731
.000
22.875
.000
a. Dependent Variable: Pct BF
Model Summary
Model
1
R
.824a
R Square
.678
Adjusted R
Std. Error of
Square
the Estimate
.677
4.7126
R2 = .678, so we know that almost 68% of the variability in the body fat
percentage is accounted for by the waist size.
What other variables might be used to predict body fat percentage?
Can we improve the prediction by including additional variables?
4
The Multiple Linear Regression Model
We have n observations on k explanatory variables X1, X2, X3, …, Xk and
a response variable, Y. The multiple regression model is:
Y = β0 + β1x1 + β2x2 +  + βkxk+ ε
where ε ~ N(0, σε) and the errors are independent from one another.
The predictor variables may be higher powers or other functions of
quantitative variables, coded categorical variables, or interaction terms.
The main restriction is that the model is linear; that is, each term is a
constant multiple of a predictor.
5
Fitting a Multiple Linear Regression Model
As we did in Simple Linear Regression, we will choose a possible set of
predictors, estimate the coefficients based on sample data, and assess
the fit. We will again use the sum of squared residuals, where the
residuals are the differences between the actual Y values and the Y
values predicted by the prediction equation
ˆ = βˆ + βˆ X + βˆ X +    + βˆ X
Y
0
1 1
2 2
k k
and use SPSS to determine the estimates of the coefficients βi that
minimize the sum of the squared residuals.
6
We will test the hypotheses
H0: β1 = β2 = β3 =  = βk = 0
Ha: The slopes are not all zero.
Our assumptions are:
- The y-values are independent of each other
- Y has a constant variance for any combination of predictors
- The values of y are normally distributed for any fixed set of
values for the explanatory variables
That is, the errors are independent values from a N(0, σε)
distribution.
7
If the null hypothesis is rejected, then test a null hypothesis for
each of the coefficients:
H0: βj = 0
Ha: βj ≠ 0
Note: If the null hypothesis is not rejected, it does not mean that
the corresponding predictor variable has no relationship to y; it
means that the predictor variable contributes nothing to modeling
y after allowing for all the other predictors.
8
The hypotheses for fitting a multiple linear regression model to predict
body fat percentage based on waist size and height are
H0: βheight = βweight = 0
Ha: The slopes are not both zero.
9
Here are the scatterplots using the individual predictors:
Although this suggests a linear relationship between waist size and body
fat percentage, there doesn't appear to be a linear relationship between
height and body fat percentage.
10
Here are some of the results for a multiple regression analysis with both
height and waist as predictors:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
Coefficients
Beta
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
2
1 Residual
4912.743 247
Total
17128.820 249
Mean Square
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
The p-value for height is close to 0, so we know that height does
contribute to the multiple regression model.
11
The graph shown below is called a scatterplot matrix. It shows the
scatterplots for all pairs of the variables we are using
Which pair of variables shows a strong
linear relationship?
Which pair of variables shows a weak
linear relationship?
Which pair of variables shows no
linear relationship?
12
The graph shown below is called a scatterplot matrix. It shows the
scatterplots for all pairs of the variables we are using
Which pair of variables shows a strong
linear relationship?
Pct BF and Waist
Which pair of variables shows a weak
linear relationship?
Height and Waist
Which pair of variables shows no
linear relationship?
Pct BF and Height
13
Residual Analysis
These plots tell us that there is no particular scatter to the residuals, and
that the distribution of the residuals is close to normal.
14
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
Adjusted R
Std. Error of the
Square
Estimate
R Square
.713
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
What is the fitted regression equation?
15
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
What is the fitted regression equation?
%BodyFat = 1.773 waist - .601 height - 3.110
16
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
%BodyFat = 1.773 waist - .601 height - 3.110
What does the value 1.773 tell you? An increase of one inch in the waist
measurement is associated with an increase of 1.773 in body fat
percentage.
17
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
%BodyFat = 1.773 waist - .601 height - 3.110
What does the value 1.773 tell you? An increase of one inch in the waist
measurement is associated with an increase of 1.773 in body fat
percentage for men of a particular height.
18
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
%BodyFat = 1.773 waist - .601 height - 3.110
What change in Body Fat Percentage is associated with each additional
inch of height? An increase of one inch of height is associated with an
decrease of .601 in body fat percentage for men of a particular weight.
19
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
%BodyFat = 1.773 waist - .601 height - 3.110
What change in Body Fat Percentage is associated with each additional
inch of height? An increase of one inch of height is associated with an
decrease of .601 in body fat percentage for men of a particular weight.
20
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
What is the value of R2 ? What does it tell you?
21
Use the SPSS output provided to answer the questions below:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
Model Summaryb
Model
1
R
.845a
R Square
.713
Adjusted R
Std. Error of the
Square
Estimate
.711
4.4598
a. Predictors: (Constant), Height, Waist
b. Dependent Variable: Pct BF
What is the value of R2 ? What does it tell you? R2 = .713 which tells us
that height and waist size together account for about 71.3% of the
variation in the body fat percentage for men.
22
Use the SPSS results to complete the hypothesis test:
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
Mean Square
2
1 Residual
4912.743 247
Total
17128.820 249
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
The value the test statistic is:
307.096
p = 0+
What can you conclude? Since p is close to zero, the null hypothesis is
rejected. This data indicates that there is a linear relationship between
body fat percentage and the predictor variables Waist and Height.
23
Use the SPSS results to complete the hypothesis test:
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
Mean Square
2
1 Residual
4912.743 247
Total
17128.820 249
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
The value the test statistic is:
307.096
p = 0+
What can you conclude? is close to zero, the null hypothesis is rejected.
This data indicates that there is a linear relationship between body fat
percentage and the predictor variables Waist and Height.
24
Use the SPSS results to complete the hypothesis test:
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
Mean Square
2
1 Residual
4912.743 247
Total
17128.820 249
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
The value the test statistic is:
307.096
p = 0+
What can you conclude? Since p is close to zero, the null hypothesis
is rejected. This data indicates that there is a linear relationship between
body fat percentage and the predictor variables waist and height.is close
to zero, the null hypothesis is rejected. This data indicates that there is a
linear relationship between body fat percentage and the predictor
variables Waist and Height.
25
We also want to estimate the standard deviation of the error term, σε
As we add a new predictor to the model, we have a new coefficient to
estimate, and so we lose one more degree of freedom.
The estimate for the standard error of the multiple regression model with
k predictors is
σ̂ε 
SSE
nk 1
26
Use the SPSS output to find the standard error of this regression model:
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
2
1 Residual
4912.743 247
Total
17128.820 249
Mean Square
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
SSE
σ̂ε 

nk 1
27
Use the SPSS output to find the standard error of this regression model:
a
ANOVA
Model
Regression
Sum of Squares
12216.077
df
2
1 Residual
4912.743 247
Total
17128.820 249
Mean Square
F
Sig.
6108.038 307.096 .000
b
19.890
a. Dependent Variable: Pct BF
b. Predictors: (Constant), Height, Waist
SSE
σ̂ε 

nk 1
4912.743
 19.8896  4.4598
247
28
Assessing a Multiple Regression Model
Individual t-Tests for Coefficients in Multiple Regression
In order to determine whether any one of the predictor variables is helpful
to include in the model, we test the coefficient for that predictor:
H0: βi = 0
Ha: βi ≠ 0
ˆ i  0
The test statistic is t 
with n - k - 1 degrees of freedom.
ˆ
SE()
29
It is important to remember that the meaning of each coefficient depends
on all of the predictors in the regression model.
If we fail to reject the null hypothesis, it means that the corresponding
predictor variable contributes nothing to the multiple regression model
after allowing for all other predictors.
30
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Coefficients
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βheight = 0
Ha: βheight ≠ 0
t=
p=
What is your conclusion?
31
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Coefficients
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βheight = 0
Ha: βheight ≠ 0
t = -5.47
p = 0+
What is your conclusion?
32
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Coefficients
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βheight = 0
Ha: βheight ≠ 0
t = -5.47
p = 0+
What is your conclusion?
Since p is close to 0, we will reject the null hypothesis.
There is evidence that the percent of body fat is related to the
height.
We can conclude that the body fat percentage changes as the
height changes, for men with the same waist size.
33
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Coefficients
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist ≠ 0
t=
p=
What is your conclusion?
34
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist ≠ 0
t = 24.768
p = 0+
What is your conclusion?
35
Use the SPSS output to test the coefficients in our model:
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist ≠ 0
t = 24.768
p = 0+
What is your conclusion?
Since p is close to 0, we will reject the null hypothesis.
There is evidence that the percent of body fat is related to the
waist size.
We can conclude that the body fat percentage changes as the
waist size changes, for men of the same height.
36
Can we do a one-tailed test?
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist > 0
t = 24.768
p=
What is your conclusion?
Since p is close to 0, we will reject the null hypothesis.
There is evidence that the percent of body fat is related to the
waist size.
We can conclude that the body fat percentage changes as the
waist size changes, for men of the same height.
37
Can we do a one-tailed test?
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist > 0
t = 24.768
p = .000/2 = 0+
What is your conclusion?
Since p is close to 0, we will reject the null hypothesis.
There is evidence that the percent of body fat is related to the
waist size.
We can conclude that the body fat percentage changes as the
waist size changes, for men of the same height.
38
Can we do a one-tailed test?
Coefficientsa
Standardized
Unstandardized Coefficients
Model
1
B
(Constant)
Coefficients
Std. Error
-3.110
7.687
Waist
1.773
.072
Height
-.601
.110
Beta
t
Sig.
-.405
.686
.859
24.768
.000
-.190
-5.470
.000
a. Dependent Variable: Pct BF
H0: βwaist = 0
Ha: βwaist > 0
t = 24.768
p = .000/2 = 0+
What is your conclusion?
Since p is close to 0, we will reject the null hypothesis.
There is evidence that the percent of body fat is related to the
waist size.
We can conclude that the body fat percentage increases as the
waist size changes, for men of the same height.to 0, we will
39
Adjusted R2
The adjusted R2 is an adjustment to R2 that takes the sample size and
the number of parameters (βj) into consideration.
The adjusted R2 increases as more predictors are added to the model,
and so it can be useful in comparing regression models with different
numbers of predictor variables.
40
Creating a Scatterplot Matrix
Click on Graphs > Chart Builder.
Select Scatter/Dot from the list of charts.
Drag the Scatterplot Matrix to the window.
41
Drag the matrix variables to the horizontal axis.
Click on OK.
The scatterplot matrix will appear in the Output Viewer.
42
43
Estimating the Model
Click on Analyze > Regression > Linear
Drag the dependent variable and all independent variables
to the appropriate locations.
Click on OK.
44
This will produce several tables:
Model Summary
Model
R
R Square
.845a
1
Adjusted R Square
.713
Std. Error of the Estimate
.711
4.4598
a. Predictors: (Constant), Waist, Height
Coefficientsa
Standardized
Coefficients
Unstandardized Coefficients
Model
1
B
(Constant)
Std. Error
Beta
-3.110
7.687
Height
-.601
.110
Waist
1.773
.072
t
Sig.
-.405
.686
-.190
-5.470
.000
.859
24.768
.000
a. Dependent Variable: Pct BF
ANOVAb
Sum of
Squares
Model
1
Regression
Mean Square
12216.077
2
6108.038
4912.743
247
19.890
17128.820
249
Residual
Total
df
F
307.096
Sig.
.000a
a. Predictors: (Constant), Waist, Height
b. Dependent Variable: Pct BF
45
If you click on Plots in the Linear Regression dialog box, you will get this
dialog box:
Plot the *ZRESIDS on the Y axis against the *ZPRED values on the X axis.
You may also choose to create a Normal Probability Plot and/or histogram of
the residuals.
46
Click on Continue and then OK. Here are the results:
47