JMP

Transcription

JMP

Chapter 23
Poisson Regression
Contents
23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
23.2 Experimental design . . . . . . . . . . . . . . . . . . . . . .
23.3 Data structure . . . . . . . . . . . . . . . . . . . . . . . . .
23.4 Single continuous X variable . . . . . . . . . . . . . . . . .
23.5 Single continuous X variable - dealing with overdispersion .
23.6 Single Continuous X variable with an OFFSET . . . . . . .
23.7 ANCOVA models . . . . . . . . . . . . . . . . . . . . . . . .
23.8 Categorical X variables - a designed experiment . . . . . . .
23.9 Log-linear models for multi-dimensional contingency tables
23.10Variable selection methods . . . . . . . . . . . . . . . . . . .
23.11Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1543
1547
1547
1547
1552
1566
1578
1587
1595
1595
1596
The suggested citation for this chapter of notes is:
Schwarz, C. J. (2015). Poisson Regression.
In Course Notes for Beginning and Intermediate Statistics.
Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved
2015-08-20.
23.1
Introduction
In past chapters, multiple-regression methods were used to predict a continuous Y variable given a set
of predictors, and logistic regression methods were used to predict a dichotomous categorical variable
given a set of predictors.
In this chapter, we will explore the use of Poisson-regression methods that are typically used to
predict counts of (rare) events given a set of predictors.
Just as multiple-regression implicitly assumed that the Y variable had a normal distribution and
logistic-regression assumed that the choice of categories in Y was based on binomial distribution, Poisson regression assumes that the observed counts are generated from a Poisson distribution.
1543
CHAPTER 23. POISSON REGRESSION
The Poisson distribution is often used to model count data when the events being counted are somewhat rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird,
etc. It is characterized by the expected number of events to occur µ with probability mass functions:
P (Y = y|µ) =
e−µ µy
y!
where y! = y(y − 1)(y − 2) . . . 2(1), and y ≥ 0. The probability mass function is available in tabular
form, or can be computed by many statistical packages. While the values of Y are restricted to being
non-negative integers, it is not necessary for µ to be an integer.
In the following graph, 1000 observations were each generated from a Poisson distribution with
differing means.
c
2015
Carl James Schwarz
1544
2015-08-20
c
2015
Carl James Schwarz
1545
2015-08-20
For very small values of µ, virtually all the counts are zero, with only a few counts that are positive. As µ
increases, the shape of the distribution look more and more like a normal distribution – indeed for large
µ, a normal distribution can be used as an approximation to the distribution of Y .
Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = N λ where λ
is the rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000
people could be modeled using λ as the rate per 1000 people, and the N = 100.
Two important properties of the Poisson distribution are:
E[Y ] = µ
V [y] = µ
Unlike the normal distribution which has a separate parameter for the mean and variance, the Poisson
distribution variance is equal to the mean. This means that once you estimate the mean, you have also
estimated the variance and so it is not necessary to have replicate counts to estimate the sample variance
from data. As will be seen later, this can be quite limiting when for many population, the data are
over-dispersed, i.e. the variance is greater than you would expect from a simple Poisson distribution.
Another important property is that the Poisson distribution is additive. If Y1 is a Poisson(µ1 ), and Y2
is a Poisson(µ2 ), then Y = Y1 + Y2 is also Poisson(µ = µ1 + µ2 ).
Lastly, the Poisson distribution is a limiting distribution of a Binomial distribution as n becomes
large and p becomes very small.
Poisson regression is another example of a Generalized Linear Model (GLIM)1 . As in all GLIM’s,
the modeling process is a three step affair:
Yi is assumed P oisson(µi )
φi = log(µi )
φi = β0 + β1 Xi1 + β2 Xi2 + . . .
Here the link function is the natural logarithms log. In many cases, the mean changes in a multiplicative fashion. For example, if population size doubled, then the expected number of cancer cases should
also double. As population age, the rate of cancer increases linearly on a log-scale. Additionally, by
modeling the log(µi ), it is impossible to get negative estimates of the mean.
The linear part of the GLIM can consist of continuous X or categorical X or mixtures of both types
of predictors. Categorical variables will be converted to indicator variables in exactly the same way as
in multiple- and logistic-regression.
Unlike multiple-regression, there are no closed form solutions to give estimates of parameters. Standard maximum likelihood estimation (MLE) methods are used.2 MLEs are guaranteed to be the “best”
estimators (smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are not large. Standard methods are used to estimate the standard errors of the estimates. Model
comparisons are done using likelihood-ratio tests whose test statistics follow a chi-square distribution
which is used to give a p-value which is interpreted in the standard fashion. Predictions are done in the
usual fashion – these initially appear on the log-scale and must be anti-logged to provide estimates on
the ordinary scale.
1 Logistic
2A
regression is another GLIM.
discussion of the theory of MLE is beyond the scope of this course, but is covered in Stat-330 and Stat-402.
c
2015
Carl James Schwarz
1546
2015-08-20
23.2
Experimental design
In this chapter, we will again assume that the data are collected under a completely randomized design.
In some of the examples that follow, blocked designs will be analyzed, but we will not explore how to
analyze split-plot or repeated measure designs or design with pseudo-replication.
The analysis of such designs in a generalized linear models framework is possible – please consult
with a statistician if you have a complex experimental design.
23.3
Data structure
The data structure is straightforward. Columns represent variables and rows represent observations. The
response variable, Y , will be a count of the number of events and will be set to continuous scale. The
predictor variables, X, can be either continuous or categorical – in the later case, indicator variables will
be created.
As usual, the coding that a package uses for indicator variables is important if you want to interpret
directly the estimates of the effect of the indicator variable. Consult the documentation for the package
for details.
23.4
Single continuous X variable
The JMP file salamanders-burn.jmp available in the Sample Program Library at http://www.stat.
sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms contains data on the number of salamanders in a fixed size quadrat at various locations in a large forest. The location of quadrats were chosen to
represent a range of years since a forest fire burned the understory.
A simple plot of the data:
c
2015
Carl James Schwarz
1547
2015-08-20
shows an increasing relationship between the number of salamanders and the time since the forest understory burned.
Why can’t a simple regression analysis using standard normal theory be used to fit the curve?
First, the assumption of normality is suspect. The counts of the number of salamanders are discrete
with most under 10. It is impossible to get a negative number of salamanders so the bottom left part of
the graph would require the normal distribution to be truncated at Y = 0.
Second, it appears that the variance of the counts at any particular age increases with age since
burned. This violates the assumption of equal variance for all X values made for standard regression
models.
Third, the fitted line from ordinary regression could go negative. It is impossible to have a negative
number of salamanders.
It seems reasonable that a Poisson distribution could be used to model the number of salamanders.
They are relatively rare and seem to forage independently of each other. This conditions are the underpinnings of a Poisson distribution.
The process of fitting the model and interpreting the output is analogous to those used in logistic
regression.
The basic model is then:
Yi ∼P oisson(µi )
θi =log(µi )
θi =β0 + β1 Yearsi
As in the logistic model, the distribution of the data about the mean (line 1) has a link function (line
2) between the mean for each Y and the linear structural part of the model (line 3). In logistic regression,
c
2015
Carl James Schwarz
1548
2015-08-20
the logit link was used to ensure that all values of p were between 0 and 1. In Poisson regression, the log
(natural logarithm) is traditionally used to ensure that the mean is always positive.
The model must be fit using maximum likelihood methods, just like in logistic regression.
This model is fit in JMP using the Analyze->Fit Model platform:
Be sure to specify the proper distribution and link function.
This gives the output:
c
2015
Carl James Schwarz
1549
2015-08-20
Most of the output parallels that seen in logistic regression. At the top of the output is a summary
of variable being analyzed, the distribution for the raw data, the link used, and the total number of
observation (rows in the dataset).
The Whole Model Test is analogous to that in multiple-regression - is there evidence that the set of
predictors (in this case there is only one predictor) have any predictive ability over that seen by random
chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model
with only the intercept. The p-value is very small, indicating that the model has some predictive ability.
[Because there is only 1 predictor, this test is equivalent to the Effect Test discussed below.]
The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model
where every observation is predicted individually. If the model fits well, the chi-square test statistic
should be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much
larger than .05.3 There is no evidence of a problem in the fit. Later in this section, we will examine how
to adjust for slight lack of fit.
The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set
of indicator variables) makes a statistically significant marginal contribution to the fit. As in multipleregression model, this are MARGINAL contributions, i.e. assuming that all other variables remain in the
model and fixed at their current value. There is only one predictor, and there is strong evidence against
the hypothesis of no marginal contribution.
Finally, the Parameter Estimates section reports the estimated β’s. So our fitted model is:
Yi ∼P oisson(µi )
θi =log(µi )
θi =0.59 + .045Yearsi
Each line also tests if the corresponding population coefficient is zero. Because each of the X
variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests
match the effect tests.
3 Remember,
that in goodness-of-fit tests, you DON’T want to find evidence against the null hypothesis.
c
2015
Carl James Schwarz
1550
2015-08-20
We can obtain predictions by following the drop down menu:
For example, consider the first row of the data. At 12 years since the last burn, we estimate the mean
response by starting at the bottom of the model and working upwards:
θ1 =0.59 + .045(12) = 1.12
µ1 =exp(1.12) = 3.08
which is the predicted value in the table.
As in ordinary normal-theory regression, confidence limits for the mean response and for individual
response may be found. The above table shows the confidence interval for the mean response.
Finally, a residual plot may also be constructed:
c
2015
Carl James Schwarz
1551
2015-08-20
There is no evidence of a lack-of-fit.
Single continuous X variable - dealing with overdispersion
23.5
One of the weaknesses of Poisson regression is the very restrictive assumption that the variance of a
Poisson distribution is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is
greater than predicted by a simple Poisson distribution. In this section, we will illustrate how to detect
overdispersion and how to adjust the analysis to account for overdispersion.
In the section on Logistic Regression, a dataset was examined on nesting horseshoe crabs4 that is
analyzed in Agresti’s book.5
The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs,
Limulus polyphemus. Ethology, 102, 1-21. Again it is important to check that the design is a completely
randomized design or a simple random sampling. As in regression models, you do have some flexibility
in the choice of the X settings, but for a particular weight and color, the data must be selected at random
from that relevant population.
Each female horseshoe crab had a male resident in her nest. The study investigated other factors
affecting whether the female had any other males, called satellites residing nearby. These other factors
includes:
4 See
http://en.wikipedia.org/wiki/Horseshoe_crab.
are available from Agresti’s web site at http://www.stat.ufl.edu/~aa/cda/sas/sas.html.
5 These
c
2015
Carl James Schwarz
1552
2015-08-20
• crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark.
• spine condition where 1=both good, 2=one worn or broken, or 3=both worn or broken.
• weight
• carapace width
In the section on Logistic Regression, a derived variable on the presence or absence of satellite males
was examined. In this section, we will examine the actual number of satellite males.
A JMP dataset crabsatellites.jmp is available from the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. A portion of the datafile is shown
below:
Note that the color and spine condition variables should be declared with an ordinal scale despite having
numerical codes. In this analysis we will use the actual number of satellite males.
As noted on the section on Logistic Regression, a preliminary scatter plot of the variables shows
some interesting features.
c
2015
Carl James Schwarz
1553
2015-08-20
There is a very high positive relationship between carapace width and weight, but there are few anomalous crabs that should be investigated further as shown in this magnified plot:
c
2015
Carl James Schwarz
1554
2015-08-20
There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the
weights should be in the 2200-2300 g range, i.e. a typographical error in the first digit. There is a single
crab whose weight suggests a width of 24 cm rather than 21 cm – perhaps a typo in the last digit. Finally,
there is one crab which is extremely large compared to the rest of the group. In the analysis that follows,
I’ve excluded these five data values.
To begin with, fit a model that attempts to predict the mean number of satellite crabs as a function of
the weight of the female crab, i.e.
Yi distributed P oisson(µi )
λi = log(µi )
µi = β0 + β1 W eighti
The Generalized Linear Model platform of JMP is used:
c
2015
Carl James Schwarz
1555
2015-08-20
This gives selected output:
c
2015
Carl James Schwarz
1556
2015-08-20
There are two parts of the output which show that the fit is not very satisfactory. First while the
studentized residual plot does not show any structural defects (the residuals are scattered around zero)6 ,
it does show substantial numbers of points outside of the (−2, 2) range. This suggests that the data are
too variable relative to the Passion assumption. Second, the Goodness-of-fit statistic has a vary small
p-value indicating that the data are not well fit by the model.
This is an example of overdispersion. To see this overdispersion, divide the weight classes into
6 The
“lines” in the plot are artifacts of the discrete nature of the response. See the chapter on residual plots for more details.
c
2015
Carl James Schwarz
1557
2015-08-20
categories, e.g. 0000 − 2500 g, 2500 − 3000 g., etc. [This has already been done in the dataset.]7
Now find the mean and variance of the number of satellite males for each weight class using the Tables>Summary platform:
If the Poisson assumption were true, then the variance of the number of satellite males should be roughly
7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes
ensuring that at least 20-30 observations are in each class.
c
2015
Carl James Schwarz
1558
2015-08-20
equal to the mean in each class. In fact, the variance in the number of satellite males appears to be
roughly 3× that of the mean.
With generalized linear models, there are two ways to adjust for over-dispersion.
A different distribution can be used that is more flexible in the mean-to-variance ratio. A common
distribution that is used in these cases is the negative binomial distribution. In more advanced classes,
you will learn that the negative binomial distribution can arise from a Poisson distribution with extra
variation in the mean rates. JMP does not allow the fitting of a negative binomial distribution, but this
option is available in SAS.
An “ad hoc” method, that nevertheless has theoretical justification, is to allow some flexibility in the
variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called
the over-dispersion factor. Note that if this formulation is used, the data are no longer distributed as a
Poisson distribution; in fact, there is NO actual probability function that has this property. Nevertheless,
this quasi-distribution still has nice properties and the over-dispersion factor can be estimated using
quasi-likelihood methods that are analogous to regular likelihood methods.
The end result is that the over-dispersion factor is used to adjust the se√and the test-statistics. The
c. The adjusted chi-square
adjusted se are obtained by multiplying the se from the Poisson model by b
test statistics are found by dividing the test statistic from the poisson model by b
c, and p-value is adjusted
by looking up the adjusted test-statistic in the appropriate table.
How is the over-dispersion factor c estimated? There are two methods, both of which are asymptotically equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of
freedom:
goodness-of-fit-statistic
b
c=
df
Usually, b
c’s of less than 10 (corresponding to a potential inflation in the se by a factor of about 3) are
acceptable – if the inflation factor is more than about 10, the lack-of-fit is so large that alternate methods
should be used.
In JMP, the adjustment of over-dispersion occurs in the Analyze->Fit Model dialogue box:
c
2015
Carl James Schwarz
1559
2015-08-20
The revised output is now:
c
2015
Carl James Schwarz
1560
2015-08-20
Notice that the overdispersion factor has been estimated as
b
c=
chi − square
519.7857
=
= 3.13
df
166
This is very close to the “guess” that we made based on looking at the variance-to-mean ratio among
weight classes.
The estimated intercept and slope are unchanged and their interpretation is as before. For example,
the estimated slope of .000668 is the estimated increase in the log number of male satellite crabs when
the female crab’s weight increases by 1 g. A 1000g increase in body-weight corresponds to a 1000 ×
.000668 = .668 increase in the log(number of satellite males) which corresponds to an increase by a
factor of e.668 = 1.95,
the mean number of male satellite crabs almost doubles. The estimated se has
√ i.e. √
c = 3.13 = 1.77. The confidence intervals for the slope and intercept are now
been “inflated” by b
wider.
The chi-square test statistics have been “deflated” by b
c and the p-values have been adjusted accordingly.
c
2015
Carl James Schwarz
1561
2015-08-20
√
Finally, the residual plot has been rescaled by the factor of b
c and now most residuals lie between −2
and 2. Note that the pattern of the residual plot doesn’t change; all that the over-dispersion adjustment
does is to change the residual variance so that the standardization brings them closer to 0.
Predictions of the mean response at levels of X are obtained in the usual fashion:
giving (partial output):
The se of the predicted mean will also have been be adjusted for overdispersion as will have the confidence intervals for the mean number of male satellite crabs at each weight value.
However, notice that the menu item for a prediction interval for the INDIVIDUAL response is
“grayed out” and it is now impossible to obtain prediction intervals for the ACTUAl number of events.
By using the overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution – in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that
implicitly assumed using the overdispersion factor. Without an actual distribution, it is impossible to
make predictions for individual events.
We save the predicted values to the dataset and do a plot of the final results on both the ordinary
c
2015
Carl James Schwarz
1562
2015-08-20
scale:
c
2015
Carl James Schwarz
1563
2015-08-20
and on the log-scale (the scale where the model is “linear”):
c
2015
Carl James Schwarz
1564
2015-08-20
c
2015
Carl James Schwarz
1565
2015-08-20
23.6
Single Continuous X variable with an OFFSET
In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g.
the number of satellite males around a single female). In some cases, the sampling unit are of different
sizes.
For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot
is constant. However, it is conceivable that the size of the plot varies because different people collected
different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of
fish captured in a fishing trip), the time intervals could be of different size.
Often these type of data are pre-standardized, i.e. converted to a per m2 or per hour basis and then an
analysis is attempted on this standardized variable. However, standardization destroys the poisson shape
of the data and turns out to be unnecessary if the size of the sampling unit is also collected.
The incidence of non melanoma skin cancer among women in the early 1970’s in Minneapolis-St
Paul, Minnesota, and Dallas-Fort Worth, Texas is summarized below:
c
2015
Carl James Schwarz
1566
2015-08-20
City
Age Class
Age Mid
Count
Pop Size
msp
15-24
20
1
172,675
msp
25-34
30
16
123,065
msp
35-44
40
30
96,216
msp
45-54
50
71
92,051
msp
55-64
60
102
72,159
msp
65-74
70
130
54,722
msp
75-84
80
133
32,185
msp
85+
90
40
8,328
dfw
15-24
20
4
181,343
dfw
25-34
30
38
146,207
dfw
35-44
40
119
121,374
dfw
45-54
50
221
111,353
dfw
55-64
60
259
83,004
dfw
65-74
70
310
55,932
dfw
75-84
80
226
29,007
dfw
85+
90
65
7,538
We will first examine the relationship of cancer incidence to age by using the age midpoint as our
continuous X variable and only using the Minneapolis data (for now).
The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at
http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
Is there a relationship between the age of a cohort and the cancer incidence rate? Notice that a
comparison of the raw counts is not very sensible because of the different size of the age cohorts. Most
people would first STANDARDIZE the incidence rate, e.g. find the incidence per person by dividing the
number of cancers by the number of people in each cohort:
A plot of the standardized incidence rate by the mid-age of each cohort:
c
2015
Carl James Schwarz
1567
2015-08-20
shows a curved relationship between the incidence rate and the mid-point of the age-cohort. This suggests a theoretical model of the form:
Incidence = Ceage
i.e. an exponential increase in the cancer rates with age.
This suggests that a log-transform is applied to BOTH sides, but a plot: of the logarithm of the
incidence rate against log(age midpoint):
c
2015
Carl James Schwarz
1568
2015-08-20
is still not linear with a dip for the youhgest cohorts. There appears to be a strong relationship between
the log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite
nicely, i.e. a model of the form
log(incidence) = β0 + β1 log(age) + β2 log(age)2 + residual
.
Is it possible to include the population size direct? Expand the above model:
log(incidence) = β0 + β1 log(age) + β2 log(age)2 + residual
count
log(
) = β0 + β1 log(age) + β2 log(age)2 + residual
pop size
log(count) − log(pop size) = β0 + β1 log(age) + β2 log(age)2 + residual
log(count) = log(pop size) + β0 + β1 log(age) + β2 log(age)2 + residual
Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient
associated with log(pop size).
Also notice that log(P OP SiZE) is known in advance and is NOT a parameter to be estimated.
Variables such as population size are often called offset variables and notice that most packages expect
to see the offset variable pre-transformed depending upon the link function used. In this case, the log
link was used, so the offset is log(P OP SIZEage ) as you will see in a minute.
c
2015
Carl James Schwarz
1569
2015-08-20
Our GLIM model will then be:
Yage distributed P oisson(µage )
φage = log(µage ) = log(P OP SIZEage ) + log(λage )
φage = β0 + β1 log(AGE) + β2 log(AGE)2
This can be rewritten slightly as:
Yage distributed P oisson(µage )
φage = log(µage ) = log(P OP SIZEage ) + log(λage )
φage = log(P OP SIZEage ) + log(λage ) = β0 + β1 log(AGE) + β2 log(AGE)2
or
log(λage ) = β0 + β1 log(AGE) + β2 log(AGE)2 − log(P OP SIZEage )
So the modeling can be done in terms of estimating the effect of log(age) upon the incidence rate,
rather than the raw counts, as long as the offset variable (log(P OP SIZEage )) is known.
To perform a Poisson regression, first create the offset variable (log(P OP SIZEage )) using the
formula editor of JMP.
The Analyze->Fit Model platform launches the analysis:
c
2015
Carl James Schwarz
1570
2015-08-20
Note that the raw count is the Y variable, and that the offset variable is specified separately from the X
variables.
The output is:
The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust for over-dispersion.
Based on the results of the Effect Test for the quadratic term, it appears that a linear fit may actually
be sufficient as the p-value for the quadratic term is almost 10%.. The reason for this apparent non-need
for the quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence
rate is very imprecisely estimated.
Finally, the Parameter Estimates section reports the estimated β’s (remember these are on the logscale). Each line also tests if the corresponding population coefficient is zero. Because each of the X
variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests
match the effect tests.
Based on the output so far, it appears that we can drop the quadratic term.’ This term was dropped,
and the model refit:
c
2015
Carl James Schwarz
1571
2015-08-20
The final model is
d
log(λ)
age = −21.32 + 3.60(log(age))
The predicted log(λ) for age 40 is found as:
d
log(λ)
40 = −21.32 + 3.60(log(40)) = −8.04
This incidence rate is on the log-scale, so the predicted incidence rate is found by taking the anti-logs, or
e−8.04 = .000322 or .322/thousand people or 322/million people.
In order to make predictions about the expected number of cancers in each age cohort, that would be
seen under this model, you would need to add back the log(P OP SIZE) for the appropriate age class:
d40 ) = log(λ)
d
log(µ
40 + log(P OP SIZE40 ) = −8.04 + 11.47 = 3.42
Finally, the predicted number of cases is simply the anti-log of this value:
d40
logµ
Yc
= e3.42 = 30.96
40 = e
Of course, this can be done automatically the the platform by requesting:
c
2015
Carl James Schwarz
1572
2015-08-20
This also allows you save the confidence limits for the average number (the mean confidence bounds) of
skin cancers expected for this age class (assuming the same population size) and confidence limits (the
individual confidence bounds)
In this case, the expected number of skin cancer cases for the 35-44 age group is 30.69 with a 95%
confidence interval for the mean number of cases ranging from (26.0 → 36.8). The confidence bound
for the actual number of cases (assuming the model is correct) is somewhere between 19 and 43 cases.
By adding new data lines to the data table (before the model fit) with the Y variable missing, but the
age and offset variable present, you can make forecasts for any set of new X values.
The residual plot:
c
2015
Carl James Schwarz
1573
2015-08-20
isn’t too bad – the large negative residual for the first age class (near when 0 skin cancers are predicted)
is a bit worrisome, I suspect this is where the quadratic curve may provide a better fit.
A plot of actual vs. predicted values can be obtained directly:
c
2015
Carl James Schwarz
1574
2015-08-20
or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X platform with Fit
Special to add the reference line:
c
2015
Carl James Schwarz
1575
2015-08-20
These plot show excellent agreement with data.
Finally, it is nice to construct an overlay plot the empirical log(rates) (the first plot constructed) with
the estimated log(rate) and confidence bounds as a function of log(age). Create the predicted log(rate)
using the formula editor and the predicted skin cancer numbers by subtracting the log(P OP SIZE)
(why?):
Repeat the same formula for the lower and upper bounds of the 95% confidence interval for the mean
number of cases:
c
2015
Carl James Schwarz
1576
2015-08-20
Finally, use the Graph → OverlayPlot to plot the empirical estimates, the predicted values of λ and the
95% confidence interval for λ on the same plot:
and fiddle8 with the plot to join up predictions and confidence bounds but leave the actual empirical
points as is to give the final plot:
8I
had to use the turn on the connect through missing option the red-triangle.
c
2015
Carl James Schwarz
1577
2015-08-20
Remember that the point with the smallest log(rate) is based on a single skin cancer case and not
very reliable. That is why the quadratic fit was likely not selected.
23.7
ANCOVA models
Just like in regular multiple-regression, it is possible to mix continuous and categorical variables and test
for parallelism of the effects. Of course this parallelism is assessed on the link scale (in most cases for
Poisson data, on the log scale).
There is nothing new compared to what was seen with ordinary regression and logistic regression.
The three appropriate models are:
log(λ) = X
log(λ) = X Cat
log(λ) = X Cat X ∗ Cat
where X is the continuous predictors, and Cat is the categorical predictor. The first model assumes
a common line for all categories of the Cat variable. The second model assumes parallel slopes, but
differing intercepts. The third model assumes separate lines for each category.
Fitting would start with the most complex model (the third model) and test if there is evidence of
non-parallelism. If none were found, the second model would be examined, and a test would be made
for common intercepts. Finally, the simplest model may be an adequate fit.
Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there
is a consistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives
more intense sun, would have a higher skin cancer rate.
The data is available in the skincancer.jmp data set in the Sample Program Library at http://www.
stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms. Use all of the data. As before,
the log(population size) will be the offset variable.
A preliminary data plot of the empirical cancer rate for the two cities:
c
2015
Carl James Schwarz
1578
2015-08-20
shows roughly parallel responses, but now the curvature is much more pronounced in Dallas.
Perhaps a quadratic model should be first fit, with separate response curve for both cities. In short
hand model notation, this is:
log(lambda) = City log(Age) log(Age)2 City ∗ log(Age) City ∗ log(Age)2
where City is the effect of the two cities, log(age) is the continuous X variable, and the interaction
terms represent the non-parallelism of the responses. This is specified as:
c
2015
Carl James Schwarz
1579
2015-08-20
As before, use the Generalized Linear Model option of the Analyze->Fit Model platform and don’t forget
to specify the log(popsize) as the offset variable. This gives the output:
c
2015
Carl James Schwarz
1580
2015-08-20
The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test
shows that this model is a reasonable fit (p-values around .30). The Effect Test shows that perhaps both
of the interaction terms can be dropped, but some care must be taken as these are marginal tests and
cannot be simply combined.
A “Chunk Test” similar to that seen in logistic regression can be done to see if both interaction terms
can be dropped simultaneously:
c
2015
Carl James Schwarz
1581
2015-08-20
c
2015
Carl James Schwarz
1582
2015-08-20
The p-value is just above α = .05 so I would be a little hesitant to drop both interaction terms. On the
other hand, some of the larger age classes have such large sample sizes and large count values that very
minor differences in fit can likely be detected.
The simpler model with two parallel quadratic curves was then fit:
c
2015
Carl James Schwarz
1583
2015-08-20
This simpler model also has no strong evidence of lack-of-fit. Now, however, the quadratic term cannot
be dropped.
The parameter estimates must be interpreted carefully for categorical data. Every package codes indicator variables in different ways, and so the interpretation of the estimates associated with the indicator
c
2015
Carl James Schwarz
1584
2015-08-20
variables differs among packages. JMP codes indicator variables so that estimates are the difference in
response between that specified level and the AVERAGE of all other levels. So in this case, the estimate
d w] = .401 represent 1/2 the distance between the two parallel curves. Conseassociated with City[df
quently, the difference in logλ between Minneapolis and Dallas is 2 × .401 = .801 (SE 2 × .026 = .05).
This is a consistent difference for all age groups.
This can also be estimated without having to worry too much about the coding details by doing a
contrast between the estimates for the city effects::
c
2015
Carl James Schwarz
1585
2015-08-20
This gives the same results as above.
This is a difference on the log-scale. As seen in earlier chapter, this can be converted to an estimate
of the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e.802 = 2.22 TIMES
the skin cancer rate of Minneapolis. This is consistent with what is seen in the raw data. The SE of this
ratio is found using an application of the Delta method9 The delta-method indicates that the SE of an
exponentiated estimate is found as
b
b θb
SE(eθ ) = SE(θ)e
9
A form of a Taylor Series Expansion. Consult many books on statistics for details.
c
2015
Carl James Schwarz
1586
2015-08-20
In this case
SE(ratio) = .052 × 2.22 = .11
Confidence bounds are found by finding the usual confidence bounds on the log-scale and then taking
anti-logs of the end points. In this case, the 95% confidence interval for the difference in log(λ) is
(.802 − 2(.052) → .802 + 2(.052)) or (.698 → .906). Taking antilogs, gives a 95% confidence interval
for the ratio of skin cancer rates as (2.01 → 2.47).
The residual plot (not shown) look reasonable.
23.8
Categorical X variables - a designed experiment
Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also
be used to analyze count data from designed experiments. However, JMP is limited to designs without
random effects, e.g. no GLIMs that involve split-plot designs.
Consider an experiment to investigate 10 treatments (a control vs. a 3x3 factorial structure for two
factors A and B) on controlling insect numbers. The experiment was run in a randomized block design
(see earlier chapters). In each block, the 10 treatments were randomized to 10 different trees. On each
tree, a trap was mounted, and the number of insects caught in each trap was recorded.
Here is the raw data.10
10 This is example 10.4.1. from SAS for Linear Models, 4th Edition. Data extracted from http://ftp.sas.com/
samples/A56655 on 2006-07-19.
c
2015
Carl James Schwarz
1587
2015-08-20
Block
Treatment
A
B
Count
1
1
1
1
6
1
2
1
2
2
1
5
2
2
3
1
8
3
2
3
1
7
3
1
1
1
0
0
0
16
1
3
1
3
4
1
6
2
3
1
1
9
3
3
1
1
4
2
1
5
2
1
1
1
9
2
2
1
2
6
2
5
2
2
4
2
8
3
2
2
2
7
3
1
2
2
0
0
0
25
2
3
1
3
3
2
6
2
3
5
2
9
3
3
0
2
4
2
1
3
3
1
1
1
2
3
2
1
2
14
3
5
2
2
6
3
8
3
2
3
3
7
3
1
2
3
0
0
0
5
3
3
1
3
5
3
6
2
3
17
3
9
3
3
2
3
4
2
1
3
4
1
1
1
22
4
2
1
2
4
4
5
2
2
3
4
8
3
2
4
4
7
3
1
3
4
0
0
0
9
4
3
1
3
5
4
6
2
3
1
4
9
3
3
9
4
4
2
1
2
c
2015
Carl James Schwarz
1588
2015-08-20
The data are available in JMP data file insectcount.jmp in the Sample Program Library at http:
//www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms.
The RCB model was fit using a generalized linear model with a log link:
Counti distributed P oisson(µi )
φi = log(µi )
φi = Block T reatment
where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and
Treatment are categorical, and will be translated to sets of indicator variables in the usual way.
This model is fit in JMP using the Analyze->Fit Model platform:
Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the
insect cages were all equal size.
This produces the output:
c
2015
Carl James Schwarz
1589
2015-08-20
The Goodness-of-fit test shows strong evidence that the model doesn’t fit as the p-values are very
small. Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model
with block and treatment interactions is needed?), failure of the Poisson assumption, or using the wrong
link-function,
The residual plot:
c
2015
Carl James Schwarz
1590
2015-08-20
shows that the data is more variable than expected by a Poisson distribution (about 95% of the residual
should be within ± 2). The base model and link function seem reasonable as there is no pattern to the
residuals, merely an over-dispersion relative to a Poisson distribution.
The adjustment of over-dispersion is made as seen earlier in the Analyze->Fit Model dialogue box:
c
2015
Carl James Schwarz
1591
2015-08-20
which gives the revised output:
Note that the over-dispersion factor b
c = 3.5. The test-statistic for the Effect Test are adjusted by this
factor (compare the chi-square of 76.37 for the treatment effects in the absence of adjusting for overc
2015
Carl James Schwarz
1592
2015-08-20
dispersion with the chi-square of 21.79 after adjusting for over-dispersion), and the p-values have been
adjusted as well.
The residuals have been adjusted by
√
b
c and now look more acceptable:
Note that the pattern of the residual plot doesn’t change; all that the over-dispersion adjustment does is
to change the residual variance so that the standardization brings them closer to 0.
If you compare the parameter estimates between
√ the two models, you will find that the estimates are
unchanged, but the reported se are increased by b
c to account for over-dispersion. As the case with all
categorical X variables, the interpretation of the estimates for the indicator variables depends upon the
coding used by the package. JMP uses a coding where each indicator variable is compared to the mean
response over all indicator variables.
Predictions of the mean response at levels of X are obtained in the usual fashion. The se will also be
adjusted for overdisperion. However, it is now impossible to obtain prediction intervals for the ACTUAl
number of events. By using the overdispersion factor, you are no longer assuming that the counts are
distributed as a Poisson distribution – in fact, there is NO REAL DISTRIBUTION that has the mean to
variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it
is impossible to make predictions for individual events.
If comparisons are of interest among the treatment levels, it is better to use the built-in Contrast
facilities of the package to compute the estimates and standard errors rather than trying to do this by
hand. For example, suppose we are interested in comparing treatment 0 (the control), to the treatment
with factor A at level 1 and factor B at level 1 (corresponding to treatment 1). The contrast is estimated
as:
c
2015
Carl James Schwarz
1593
2015-08-20
c
2015
Carl James Schwarz
1594
2015-08-20
The estimated difference in the log(mean) is -.34 (se .39) which corresponds to a ratio of e−.34 = .71
of treatment 1 to control, i.e. on, average, the number of insects in the treatment 1 traps are 71% of the
number of insects in the control trap. An application of the delta-method shows that the se of the ratio
b
b θb = .39(.71) = .28. However, there was no evidence of a difference in
is computed as se(eθ ) = se(θ)e
trap counts as the standard error was sufficiently large. A 95% confidence interval for the difference in
log(mean) is found as −.34 ± 2(.39) which gives (−1.12 → .44). Because the p-value was larger than
α = .05, this confidence interval includes zero. When this interval is anti-logged, the 95% confidence
interval for the ratio of mean counts is (.32 → 1.55), i.e. the true ratio of treatment counts to control
counts is between .32 and 1.55. Because the p-value was greater than α = .05, this interval contains the
value of 1 (indicating that the ratio of counts was 1:1). It is also correct to compute the 95% confidence
interval for the ratio using the estimated ratio ± its se@. This gives (.71 ± 2(.28)) or (.15 → 1.27). In
large samples, these confidence intervals are equivalent. In smaller samples, there is no real objective
way to choose between them.
23.9
Log-linear models for multi-dimensional contingency tables
In the chapter on logistic regression, k × 2 contingency tables were analyzed to see if the proportion of
responses in the population that fell in the two categories (e.g. survived or died) were the same across
the k levels of the factor (e.g. sex, or passenger class, or dose of a drug).
The use of logisitic regression is a special case of the general r × c contingency table where observations are classified by r levels of a factor and c levels of a response. In a separate chapter, the use of χ2
tests to test the hypothesis of equal population proportions in the c levels of the response across all levels
of the the factor. This is also known as the test of independence of the response to levels of the factor.
This can be generalized to the analysis of multi-dimensional tables using Poisson-regression. In
more advanced courses, you can learn how the two previous cases are simple cases of this more general
modelling approach. Consult Agresti’s book for a fuller account of this topic.
23.10
Variable selection methods
To be added later
c
2015
Carl James Schwarz
1595
2015-08-20
23.11
Summary
Poisson-regression is the standard tool for the analysis of “smallish” count data. If the counts are large
(say in the orders of hundreds), you could likely use ordinary or weighted regression methods without
difficulty.
This chapter only concerns itself with data collected under a simple random sample or a completely
randomized design. If the data are collected under other designs, please consult with a statistician for the
proper analysis.
A common problem that have encountered are data that have been prestandardized. For example,
data may recorded on the number of tree stems in a 100 m2 test plots. This data could likely be modeled
using poisson regression. But, then the data are standardized to a “per hectare” basis. These standardized
data are NO LONGER distributed as a Poisson distribution. It would be preferable to analyze the data
using the sampling units that were used to collect the data with an offset variable being used to adjust for
differing sizes of survey units.
A common cause of overdispersion is non-independence in the data. For example, data may be collected using a cluster design rather than by a simple random sample. Overdispersion can be accounted
for using quasi-likelihood methods. As a rule of thumb, overdispersion factor b
c of 10 or less are acceptable. Very larger overdispersion factors indicate other serious problems in the model. Alternatives to the
use of the correction factor are using a different distribution such as the negative binomial distribution.
Related models for this chapter are the Zero-inflated Poisson (ZIP) models. In these models there are
an excess number of zeroes relative to what would be expected under a Poisson model. The ZIP model
has two parts – the probability that an observation will be zero, and then the distribution of the non-zero
counts. There is a substantial base in the literature on this model. This is the end of the chapter
c
2015
Carl James Schwarz
1596
2015-08-20

JMP

Transcription

Similar documents

Infinite Edge ParOOon Models for Overlapping Community

LRTrek - ActFX Algorithmic Trading

Luxe Be a Lady

YAKETY-YAK… don`t Talk Back!

Carl Heth`s Oxen-Powered Drag Saw

Poster - University of Sussex

What got you into music, and what are your influences?

PARK PARK - Downsview Park

Presentation