Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods

Transcription

Review of Statistical Models and Linear Regression Concepts STAT E-150 Statistical Methods
STAT E-150
Statistical Methods
Review of Statistical Models
and Linear Regression Concepts
Statistical Models are used to make predictions, understand
relationships, and assess differences.
A statistical model can be written as
Data = model + error, or Y = f(x) + ε
where Y is the response variable
x is the explanatory variable
ε is the error
The error term, ε, represents the part of the response variable that is
not explained by its relationship to the predictor variable. We often
consider the probability distribution of this error term as part of our
assessment of the model.
2
The Four-Step Process for statistical modeling:
1. Choose a form for the model
Identify the variables and their types
Examine graphs to help identify the appropriate model
2. Fit the model to the data
Use the sample data to estimate the values of the model
parameters
3. Assess how well the model fits the data
Compare models
Examine the residuals
4. Use the model to make predictions, explain relationships,
assess differences
The appropriate model depends on the type of variables and the role
each variable plays in the analysis.
3
Example:
Medical researchers have noted that adolescent females are more likely
to deliver low-birthweight babies than are adult females. Because LBW
babies tend to have higher mortality rates, studies have been conducted
to examine the relationship between birthweight and the mother’s age.
One such study is discussed in the article “Body Size and Intelligence in
6-Year-Olds: Are Offspring of Teenage Mothers at Risk?” (Maternal and
Child Health Journal [2009], pp. 847-856.)
4
The following data is consistent with summary values given in the article,
and with data published by the National Center for Health Statistics:
Observation
1
2
3
4
5
6
7
8
9
10
Maternal Age (in years)
15
17
18
15
16
19
17
16
18
19
Birthweight (in grams)
2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
What are the observational units?
Teenage mothers and their babies
Which is the response variable?
The baby’s weight (in grams)
Which is the explanatory variable?
The mother’s age (in years)
5
The following data is consistent with summary values given in the article,
and with data published by the National Center for Health Statistics:
Observation
1
2
3
4
5
6
7
8
9
10
Maternal Age (in years)
15
17
18
15
16
19
17
16
18
19
Birthweight (in grams)
2289
3393
3271
2648
2897
3327
2970
2535
3138
3573
What are the observational units?
Teenage mothers and their babies
Which is the response variable?
The baby’s weight (in grams)
Which is the explanatory variable?
The mother’s age (in years)
6
Simple Linear Regression is used to investigate whether there is a
linear relationship between two quantitative variables. If a linear
relationship exists, we can create a model for the relationship, and use
this model to answer these questions:
What is the relationship between the variables?
What does the slope of this linear model tell us?
When is it appropriate to use this linear model to make predictions?
7
A First-Order Linear Model is of the form
y = β0 + β1x + ε
where
y = the response variable
x = the independent, or predictor, or explanatory variable
ε = the random error
β0 = where the regression line crosses the y-axis;
the y-intercept of the regression line is the point (0, β0 )
β1 = the slope of the regression line
change in y

change in x
 change in y for every unit increase in x
8
y = β1x + β0
9
Steps in regression
1. Hypothesize the form of the model for E(y), the mean or
expected value of y
2. Collect the sample data
3. Use the sample data to estimate the unknown parameters in the
model.
4. Specify the probability distribution of ε and estimate any
unknown parameters in the distribution. Check the validity of
the assumptions made about the probability distribution.
5. Statistically check the usefulness of the model
6. If the model is useful, use the model for appropriate prediction
and estimation
10
Notation:
Recall that Data = model + error, or Y = f(x) + ε
μy (or μy|x) is the mean value of y for a particular value of x
ε is the deviation from that mean value at a value of x
In a simple linear regression model:
μy = f(x) = β1x + β0 (the mean value of y at a given value of x)
and
y = f(x) + ε = β1x + β0 + ε (the actual value of y for a given x)
11
In our example,
μbirthweight = β1age + β0
The actual birthweights are represented by
Birthweight = β1age + β0 + ε
12
The first step in determining whether there is a linear relationship
between the variables is to create a scatterplot of the data, with the
explanatory variable on the x-axis and the response variable on the
y-axis.
13
Does there appear to be a linear relationship?
What does the scatterplot tell you about the strength and direction of
the linear relationship? Write your answer in the context of the
scenario.
14
Does there appear to be a linear relationship?
The scatter diagram shows a positive linear relationship
What does the scatterplot tell you about the strength and direction of
the linear relationship? Write your answer in the context of the
scenario.
15
What does the scatterplot tell you about the strength and direction of
the linear relationship? Write your answer in the context of the
scenario.
The scatter diagram shows that there is a fairly strong positive
linear relationship between the two variables: as the mother’s
age increases, the child’s birthweight also increased.
That is, higher birthweights are associated with older mothers.
Fitting a Simple Linear Model
If the data appears to show a linear relationship, the method of least
squares finds the line that best fits the data. That is, it will provide the
best estimates for β0 and β1.
We can find the vertical distance between the observed value of y and
the predicted value of y for each value of x. This difference is called the
residual:
17
The points should be scattered about a straight line, with deviations
from the line determined by ε. This vertical distance between the
observed value of y and the predicted value of y is called the residual.
Residual = observed value - predicted value
18
We want the size of the residuals to be as small as possible; since
some residuals are positive and some are negative, we square the
residuals and minimize the squares.
SSE, the sum of squared errors, is a measure of how well the line
predicts the actual values.
The Least Squares line is the line where and SSE is minimized.
The equation of the least squares line is y = β1x + β0
19
Some notation:
Consider the ith value in the dataset: yi = β0 + β1xi + εi
β0 and β1 are the true population values for the population; these are
parameters
β̂0 and β̂1 are estimates of the coefficients based on the sample data;
these are statistics.
20
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What does the value 245.15 represent, in context?
What does the value -1163.45 represent, in context?
21
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What does the value 245.15 represent, in context?
The childs’s birthweight is expected to increase by 125.15g for
each additional year in the age of the mother.
What does the value -1163.45 represent, in context?
If the mother’s age is 0 years, the child’s birthweight is expected
to be -1163.45 g.
22
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What does the value 245.15 represent, in context?
The childs’s birthweight is expected to increase by 125.15g for
each additional year in the age of the mother.
What does the value -1163.45 represent, in context?
If the mother’s age is 0 years, the child’s birthweight is expected
to be -1163.45 g.
23
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What birthweight would you expect for the baby of a mother who is 16
years old?
weight = 245.15 age – 1163.45
= 245.15(16) – 1163.45
= 3922.4 – 1163.45
= 2758.95
What was the birthweight for the baby of a mother who was 16 years
old?
2897 g
What is the residual? 2897 – 2758.95g
24
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What birthweight would you expect for the baby of a mother who is 16
years old?
weight = 245.15 age – 1163.45
= 245.15(16) – 1163.45
= 3922.4 – 1163.45
= 2758.95
What was the birthweight for the baby of a mother who was 16 years
old?
2897 g
What is the residual? 2897 – 2758.95g
25
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What birthweight would you expect for the baby of a mother who is 16
years old?
weight = 245.15 age – 1163.45
= 245.15(16) – 1163.45
= 3922.4 – 1163.45
= 2758.95
What was the birthweight for the baby of a mother who was 16 years
old?
2897 g
What is the residual? 2897 – 2758.95g
26
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What birthweight would you expect for the baby of a mother who is 16
years old?
weight = 245.15 age – 1163.45
= 245.15(16) – 1163.45
= 3922.4 – 1163.45
= 2758.95
What was the birthweight for the baby of a mother who was 16 years
old?
2897 g
What is the residual? 2897 – 2758.95 = 138.05g
27
In our example, the equation of the least squares line is
weight = 245.15 age – 1163.45
What birthweight would you expect for the baby of a mother who is 11
years old?
28
Conditions for a Simple Linear Model
Linearity - the scatterplot shows a general linear pattern
Zero Mean - the distribution of the errors is centered at zero
Constant Variance - the variability of the errors is the same for
all values of the predictor variable
Independence - the errors are independent of each other
29
Conditions for Inference also include:
Random - the data was obtained through a random process
Normality - the distribution of the errors is approximately
Normal
30
More about Residuals:
A residual plot is a scatterplot of the regression residuals against the
explanatory variable. Residual plots help us assess the fit of a
regression line. Here are examples of residual plots:
This residual plot shows no
systematic pattern; it shows a
uniform scatter of the points
about the fitted line, and indicates
that the regression line fits the data
well.
31
A curved pattern shows that the
data is not linear, so a straight line
is not a good fit for the data.
This residual plot shows that there
is more spread for larger values of
the explanatory variable, indicating
that predictions will be less accurate
when x is large.
You should also note any values with large residuals. These points are
outliers in the vertical (y) direction because they lie far from the line that
describes the overall pattern.
32
The Simple Linear Regression Model
For a quantitative response variable Y and a single quantitative
explanatory variable X the simple linear regression model is
Y = β0 + β1 X + β 0 + ε
where ε follows a normal distribution, that is ε ~ N(0, σε) and the errors
are independent from one another.
33
Assessing Conditions
To check the Linearity Condition, consider a scatterplot of the data to
see if the points suggest a linear relationship.
34
Assessing Conditions
Check the Constant Variance Condition with a plot of the residuals.
Graphs of the residuals can also help to determine whether the
conditions are met.
35
36
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
1
(Constant)
Age
B
-1163.450
Std. Error
783.138
245.150
45.908
Beta
.884
t
-1.486
Sig.
.176
5.340
.001
a. Dependent Variable: Birthweight
This tells us that the slope of the regression line is -3.23 and the
y-intercept is the point (0, 1676.4).
And so the equation of the regression line is
Mortality = -3.23 calcium + 1676.4
37
Mortality = -3.23 calcium + 1676.4
In other words, if the calcium level increases by one ppm, the
mortality rate is expected to decrease by 3.23 deaths per 100,000,
on average.
The y-intercept tells us that if the calcium level is 0 ppm, the
mortality rate would be 1676 deaths per 100,000. However, in this
case, this would be an extrapolation.
38
To add the graph of the regression line to the scatterplot:
> plot(x,y)
> abline(name of model)
For our data, these commands produced this graph:
> plot(calcium, mortality)
> abline(model)
39
Making Predictions; Interpolation and Extrapolation
The linear model makes it possible to make reasonable predictions
about any mean response within the range of the explanatory
variable.
Statements about the mean at values of the explanatory variable
not in the data set but within the range of the observed values are
called interpolations.
Making predictions for values outside of the range of the data is
called extrapolation and is not necessarily valid.
40
To make a prediction:
First create a data structure called a dataframe that contains the
value(s) of the explanatory variable that you want to use in your
prediction; you may use any appropriate name:
>newdata=data.frame(predictor=value)
Then attach this new value to make it available to R:
>attach(newdata)
41
Now you can make your prediction. You may choose to include
these arguments:
- a confidence interval or a prediction interval (default = none)
- level of confidence (default is .95)
> predict (model, newdata, interval=”confidence”, level=.95)
42
Example: Predict the mortality rate in a town where the hardness
level of the water is 105 ppm of calcium.
> newdata=data.frame(calcium=105)
> attach(newdata)
> predict(model, newdata, interval="confidence", level=.95)
fit
lwr
upr
1 1337.616 1270.624 1404.608
The mortality rate would be about 1338 deaths per 100,000.
43
We have predicted a mortality rate of about 1338 deaths per
100,000 for a town with a calcium level of 105 ppm. However, there
is a town with this calcium level, and the mortality rate for this town
is 1247 deaths per 100,000.
A Residual is the difference between the observed value and the
predicted value of the response variable for a particular value of the
explanatory variable.
Residual = observed value – predicted value
And so the residual for 105 ppm of calcium is 1247 – 1338 = -91
A residual plot is a scatterplot of the regression residuals against the
explanatory variable. Residual plots help us assess the fit of a
regression line.
44
Here are examples of residual plots:
This residual plot shows no systematic pattern; it shows a uniform
scatter of the points about the fitted line, and indicates that the
regression line fits the data well.
45
Here are examples of residual plots:
A curved pattern shows that the data is not linear, so a
straight line is not a good fit for the data.
46
Here are examples of residual plots:
This residual plot shows that there is more spread for larger values
of the explanatory variable, indicating that predictions will be less
accurate when x is large.
47
You should also note any values with large residuals. These points
are outliers in the vertical (y) direction because they lie far from the
line that describes the overall pattern.
48
The R commands to create a residual plot and show the line for a
zero residual are:
> plot(fitted(model), resid(model))
> abline(h=0)
49
Robustness of Least Squares Inference
What if the assumptions for this analysis are not met? What if the
scatterplot does not show a linear relationship between the
variables?
The United Nations Development Programme (UNDP) collects data
in the developing world to help countries solve global and national
development challenges. One summary measure used by the
agency is the Human Development Index (HDI) which attempts to
summarize in a single number the progress in health, education,
and economics of a country. In 2006 the HDI was as high as 0.965
for Norway and as low as 0.331 for Niger. The gross domestic
product per capita (GDPPC), by contrast, is often used to
summarize the overall economic strength of a country.
Is there a relationship between the HDI and the GDPPC?
50
Here is a scatterplot of GDPPC against HDI:
Is it appropriate to fit a linear model to this data? Why or why not?
51
Here are histograms of the GDPPC values and the log of those values.
How would you describe these distributions?
52
How would you describe the relationship between the HDI and the
log(GDPPC)?
> cor(HDI, log(GDPPC))
[1] 0.9207729
53
How would you describe the relationship between the HDI and the
log(GDPPC)?
> UN = lm(HDI~log(GDPPC))
> UN
Call:
lm(formula = HDI ~ log(GDPPC))
Coefficients:
(Intercept)
log(GDPPC)
-0.5177
0.1422
The regression equation is HDI = 0.1422 log(GDPPC) – 0.5177
54