Overview Sample Commands 10/9/2011 Lab 5: Multiple regression and non‐

Transcription

Overview Sample Commands 10/9/2011 Lab 5: Multiple regression and non‐
10/9/2011
Overview
• Multiple Linear Regression
• Non‐Linear Regression
Lab 5: Multiple regression and non‐
linear estimation (avoiding the polynomial)
– Polynomials
– Specific curves
1
2
Sample Commands
Command
graph matrix x1 x2 x3 y,half
regress y x1 x2 x3
predict newvar, resid
predict newvar
Command
Draws a scatterplot matrix. Performs ordinary least squares (OLS) regression of variable y on several independent variables x1, x2, and x3.
Generates a new variable (e) equal to the residuals obtained from the most recent regression run in STATA
The predicted values of y are placed in the new variable newvar
The residuals are placed in the variable newvar
The standardized residuals are placed in the variable newvar
The studentized residuals are placed in the variable newvar
The leverage values are placed in the variable newvar
Creates a new variable, newvar, containing the Cook’s D influence measures (see below)
The Dfits influence measures are placed in the variable newvar
predict newvar, resid
predict newvar, rstandard
predict newvar, rstudent
predict newvar, hat
predict newvar, cooksd
predict newvar, dfits
Overview
nl exp2 y x
nl (weight = ({b0} * age)/({b1}
+ age)), initial(b0 50 b1 20)
Uses iterative nonlinear least squares to fit a 2‐
parameter exponential growth model (building in function exp2)
Use iterative nonlinear least squares to fit a custom model (in this case a Michaelis‐Menton
equation) with specified initial parameter values.
Draws a residual versus fitted (predicted values) plot, automatically based on the most recent regression.
Graphs the residuals against the values of the predictor variable x1
rvfplot
rvpplot x1
graph twoway scatter e yhat, yline(0)
hettest
Sample Commands
Overview
Draws a residual versus predicted values plot using the variables e
and yhat. A horizontal line is drawn at y = 0.
Performs Cook and Weisberg’s test for heteroskedasticity. 3
Multiple Regression
4
Assessing fit
• graph matrix latitude elevation antdensity, half
• regress antdensity latitude elevation
• rvfplot, yline(0)
 Residuals
 Potential outliers
 Leverage points
 Influential points
 Homogeneity of residual variance
5
6
1
10/9/2011
Variance of residuals
Multicolinearity
• vif
• hettest
Breusch-Pagan / Cook-Weisberg test
for heteroskedasticity
Ho: Constant variance
Variables: fitted values of
antdensity
chi2(1)
Prob > chi2
=
=
Variable |
VIF
1/VIF
-------------+---------------------elevation |
1.03
0.968050
latitude |
1.03
0.968050
-------------+---------------------Mean VIF |
1.03
3.70
0.0544
7
Multicolinearity
8
Multicolinearity
• collin elevation latitude
• collin elevation latitude
Collinearity Diagnostics
Collinearity Diagnostics
SQRT
RVariable
VIF
VIF
Tolerance
Squared
---------------------------------------------------elevation
1.03
1.02
0.9681
0.0319
latitude
1.03
1.02
0.9681
0.0319
---------------------------------------------------Mean VIF
1.03
SQRT
RVariable
VIF
VIF
Tolerance
Squared
---------------------------------------------------elevation
1.03
1.02
0.9681
0.0319
latitude
1.03
1.02
0.9681
0.0319
---------------------------------------------------Mean VIF
1.03
Cond
Eigenval
Index
--------------------------------1
2.7708
1.0000
2
0.2289
3.4792
3
0.0003
97.1188
--------------------------------Condition Number
97.1188
Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
Det(correlation matrix)
0.9681
Cond
Eigenval
Index
--------------------------------1
2.7708
1.0000
2
0.2289
3.4792
3
0.0003
97.1188
--------------------------------Condition Number
97.1188
Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
Det(correlation matrix)
0.9681
9
Evaluation
10
Nonlinear Regression
• Write out the equation for the multiple regression that you have just fit.
• Are there influential points (i.e., ABS(DFFITS > 2*sqrt(k/n)))?
• Are there points with high leverage (point with leverage [H] greater than (2k+2)/n) that you should be concerned about?
• Do the studentized residuals identify any potential outliers (>±2)?
• Is the response variable (ANTDENSITY) linearly related to both independent variables?
• What do you conclude about the multiple regression?
11
• We are going to attempt to fit two curves to these data. The first, a quadratic (2nd‐order polynomial) will have the form
Y = a + b*X + c*X2
• The second will be a Michaelis Menton
equation with an intercept:
Y = a + (c*X)/(d+X)
12
2
10/9/2011
Interpreting parameters
Fitting parameters
Equation
Parameter
Initial Value
Y = a + b*X + c*X^2
a
2
b
c
1
‐0.1
a
5
c
d
50
35
Y = a + (c*X)/(d+X)
13
Fitting the curve
14
‘Automatic’ commands don’t work
• Note that you will have to use predict commands to create the residuals and predicted values for plotting because commands like rvpplot, rvfplot, hettest commands do not work after the nl
command. • For example to view a plot of residuals versus predicted values then you could use the following commands:
• predict residuals, resid
• predict yhat
• scatter residuals yhat
• nl (weight = {a} + {b}*age + {c}*age^2)
• nl (weight = {b0} + {b1}*age + {b2}*age^2), initial (b0 2 b1 1 b2 ‐1)
• nl (weight = ({a} + {c} * age)/({d} + age)), initial(a 10 c 50 d 20)
15
16
Evaluation
• In completing your assessment of the polynomial and Michaelis‐
Menton equations you should be able to answer the following questions:
• What is the equation for the polynomial that you fit to the data?
• What are your observations about the appropriateness of the polynomial model (fit, residuals, etc.)?
• Write out the Michaelis‐Menton equation that you fit to the data.
• What are your observations about the appropriateness of the Michaelis Menton model (fit, residuals, etc.)?
• Based on fit (estimates of R2 – although you can just compare the Regression SS because the TSS will be the same), which model is best? Is the statistically ‘best’ model the one that you would use (i.e., best biological interpretation)? Why or why not?
17
3