Training Manual, Volume 2 - Center for Spatially Integrated Social

Transcription

Training Manual, Volume 2 - Center for Spatially Integrated Social
Developing multilevel models
for analysing contextuality,
heterogeneity and change
using MLwiN 2.2
Volume 2
Kelvyn Jones
SV Subramanian
June 2011
0
Preface to Volume 2
The purpose of these volumes is to provide a thorough account of how to implement
multilevel models from a social science perspective in general and geographic perspective in
particular. We use the MLwiN software throughout.
Volume1 introduces the software and then provides an extended account of the Normaltheory two-level and three level model estimated by maximum likelihood and by Bayesian
MCMC analysis. Volume 2 extends the analysis in a number of ways. First we consider
discrete outcomes and in particular we provide an account of how to analyse the outcome
of employed or not in a multilevel logistic model. This is followed by the analysis of count
data and we consider the nature of HIV variations in India, a multilevel Poisson and NBD
model are used. Second, the models are extended to repeated measures panel data and
there is an extended consideration of the two-level growth model that focuses on the
flexibility of the model to estimate heterogeneity and dependency. This is further extended
to analyse age and cohort effects. A final chapter considers spatial models and the
simultaneous modelling of space time-data. This remains a work in progress as further
chapters are planned on multivariate models with more than one outcome, on itemresponse models which can be used develop measurement models for people and places,
models for segregation, and the analysis of randomized cluster trials.
1
Contents
12. Logistic modelling of proportions
Introduction
The data on teenage employment if Glasgow
Model 1: null random intercepts model
Model 2 with fixed part terms for qualifications and gender
Model 2b: changing estimation
Model 3: modelling the cross-level interaction between gender, qualifications and
adult unemployment
Estimating the VPC
Characterising the between area with effects with the Median Odds Ratio
Characterising the effect of a higher-level predictors with the Interval Odds Ratio
Comparing models: using the DIC estimated in MCMC
Some answers
3
3
5
10
19
20
23
25
26
28
32
13. HIV in India: An example of a Poisson, and a NBD model
Introduction
The data
Defining the exposure
Model 1: a single level model for Gender
Model 2: a single level model for Gender and Age Main effects
Model 3: a single level model for Gender and Age with interactions
Model 4: a single-level model for Gender-Age interactions and Education
Model 5: a two-level model for States
Comparing alternative estimates of Model 5: a two-level model for States
Model 6: between State differences for Urban-Rural
More efficient MCMC samplers
Some answers
34
34
36
38
45
47
49
56
64
70
78
80
14. Longitudinal analysis of repeated measures data
Introduction
A conceptual overview of the random-effects, subject-specific approach
Algebraic specification of random effects model for repeated measures
Some covariance structures for repeated measures data
Estimating a two and three level random intercepts and slopes model
A digression on orthogonal polynomials
Elaborating the random part of the model: accommodating temporal dependence
Discrete outcomes: population average versus subject specific
Subject-specific and population average inferences in practice.
Fixed versus Random effects
What we have learnt
Answers to Questions
-1-
87
94
107
114
119
159
163
191
195
200
216
218
15. Modelling longitudinal and cross-sectional effects
Introduction
Age, cohort and period in the Madeira Growth Study
Alternative designs for studying change
The Accelerated Longitudinal design of the Madeira Growth Study
Specifying and estimating cohort effects
Modelling age, sex and cohort
Modelling a two level model: Mundlak formulation
Changing gender ideology in the UK
Building the base model
Including longitudinal and cohort effects
Including Gender as a main effect and as interactions
Age, period and cohorts?
What we have learnt
230
230
231
233
233
235
238
242
243
251
254
256
257
16. The analysis of spatial and space-time models
Introduction:
What do we mean by adjacency: defining spatial neighbours
Three types of spatial models
Spatial lag dependence or autoregressive
Spatial residual dependence models
Spatial heterogeneity models
The spatial multiple membership model
Applying the spatial multiple membership model
Low birth weights in South Carolina
Respiratory cancer deaths in Ohio counties: space-time modellingSelf-rated health of the elderly in China
What we have learnt
-2-
258
260
263
263
263
264
266
267
268
291
305
308
12. Logistic modelling of proportions
Introduction
This chapter is the first of two chapters that is concerned with the analysis of discrete
outcomes and in particular models for when the response is a proportion. Substantively the
model is concerned with the proportion of teenagers that are in employment and what
individual characteristics influence this outcome. We also consider the degree of variation
in small areas of Glasgow, and the extent to which adult unemployment relates
differentially to teenage employment. The model is estimated as a logistic model with a
binomial level-1 random term. As such it uses many of the procedures that have been
covered in Volume 1 such as model specification, testing, the use of cross-level interactions,
and the calculation of the VPC. The initial model is estimated by maximum likelihood
procedures and later models by MCMC methods. The same basic model can also be used to
estimate binary models.
The data on teenage employment if Glasgow
Retrieve the data
File on main menu
Open worksheet
employ.ws
=
Postcode is neighbourhood in Glasgow
Cell is element of the table for each postcode
Gender is male or female
Qualif is unqualified or qualified
Employed is count of number of employed teenagers in cell
Total is number of employed and unemployed teenagers in cell
Adunemp is adult unemployment in neighbourhood
Proportion is employed/total
Code is categorical variable
1 = unqualified male
2 = unqualified females
3 = qualified males
4 = qualified females
-3-
Highlight the Names of the data; all variables
Press View button
Ensure data is sorted; cells within postcodes
Data Manipulation on main menu
Sort on Postcode and cell carry the rest and put back into original variables
-4-
Model 1: null random intercepts model
Model on main menu
Equations
Click on y and change to Proportion
Choose 2 levels
postcode as level 2
cell as level 1
Done
Click on N (for Normal theory model) and change to
Binomial distribution , then choose
Logit Link
Click on red (ie unspecified) nij inside the Binomial brackets and
choose ‘total’ to be the binomial denominator (= number of trials)
Click on B0 and choose the Constant, tick fixed effect; tick the j(postcode) to allow to
vary over postcode (it is not allowed to vary at cell level, as we are assuming that all
variation at this level is pure binomial variation)
Click on Nonlinear in the bottom toolbar; this controls specification and estimation:
Use Defaults [this gives an exact Binomial Distribution for level 1; 1st Order
Done
Linearization and MQL estimation]
Click on the + on the bottom toolbar to reveal the full specification.
At this point, the equations window should look like
The variable ‘proportion employed’ in cell i of postcode j is specified to come from a
Binomial distribution with and underlying probability, ij .. The logit of the underlying
probability is related to a Fixed effect, B0 and an allowed to vary effect u0j which as usual is
-5-
assumed to come from a Normal distribution. The level 1, cell variation is assumed to be
pure binomial variation in that it depends on the underlying probability and the total
number of teenagers in a cell; it is not a parameter that has to be estimated.
It is worth looking at the worksheet as MLwiN will have created two variables in the
background, Denom is the number of trials, that is Total in our case, while Bcons is a
constant associated with the level 1 cell which is used in the calculation of the binomial
weights; we can ignore this.
Before estimating, it is important to check the hierarchy
Model on main menu
Hierarchy viewer
Question 1: Why the variability in the number of cells?
Answers at the end of chapter
___________________________________________________________________________
-6-
Before proceeding to estimation we can check location of non-linear macros for discrete
data
Options on main menu
Directories
MLwiN creates a small file during estimation which has to be written temporarily to the
current directory, this therefore has to be a place where files can be written; consequently
you may have to change your current directory to something that can be written to. Do this
now.
After pressing start the model should converge to the following results, click on the
lower Estimates button to see the numerical values
Question 2
Who is the constant?; What is 1.176 ?What is 0.270 ?
Does the log-odds of teenage employment vary over the city?
-7-
We can store the estimates of this model as follows
Equations window
Click on Store model results
type in One in the pane
Ok
To see the results
Model on main menu
Compare stored models
This brings up the results in tabular form; these can be copied as a tab-delimited text file to
the clipboard and pasted to Microsoft Word. In Word , highlight the pasted text; Select
Table, Insert, Table.
The log-odds are rather difficult to interpret, but we can change an estimate to a probability
using the Customised predictions window:
Model on main menu
Customised predictions
In setup window
Confidence 95
Button on for Probabilities
Tick Medians
Tick Means
at bottom of pane: Fill grid
at bottom of pane: Predict
Switch to Predictions: all results have been stored in the worksheet.
The setup window should look like
-8-
The predictions window should look like:
The cluster-specific estimated probability is given by the median of 0.764, with 95%
confidence intervals of 0.737 and 0.789; while the population average values are very
similar (0.755, CI: 0.73, 0.78). If we use Descriptive statistics on the main menu we find that
the simple mean of the raw probabilities is 0.75. The median rate of employment for
teenagers in Glasgow districts is 0.76.
Returning to the Setup window we can additionally tick for the coverage for level 2
postcodes and request the 95% coverage
-9-
Click Predict and then go to the Predictions subwindow:
The estimated average teenage employment probability is 0.753, while the 95% coverage
interval for Glasgow areas is between 0.539 and 0.908. As these values are derived from
simulation, you can expect slightly different values from these.
Model 2 with fixed part terms for qualifications and gender
Returning to the equations window we can now distinguish between different types
of teenagers. Add term using Code with Unmale as the base or reference category, so that
revised model after convergence is:
- 10 -
We can store the estimates of this model as Two using the Store button on the equations
window
Model on main menu
Compare stored models
This bring up the results in tabular form
We can now calculate the probability for all four types of teenager:
Model on main menu
Customised predictions
In setup window
Clear [gets rid of previous choices]
Highlight Code and request Change Range
Click on Category and tick on each and every category for
different type of teenager (unmale etc)
Confidence 95
Button on for Probabilities
Tick Medians
Tick Means
at bottom of pane: Fill grid
at bottom of pane: Predict
The completed Setup window is:
- 11 -
Switch to Predictions tab
And the Predict window gives
The values can be copied and pasted into Word to form a table
Code.pred
unmale
Unfem
qualmale
qualfem
constant.
1
1
1
1
median.
0.6286546
0.66288924
0.82063246
0.84216172
median.low.
0.57446176
0.61281633
0.78834242
0.81049794
median.high.
0.68216538
0.71329087
0.85116911
0.87156969
mean.pred
0.61993444
0.65272975
0.80792409
0.82985991
mean.low.
0.56835616
0.60478234
0.77535808
0.79759902
mean.high.
0.67123282
0.70139945
0.83902699
0.86013967
The higher employment is found for qualified teenagers, this is most easily seen by plotting
the results
Predictions sub-window
Plot Grid
Y is Mean.pred, that is population averages
Tick 95% confidence intervals
Button error bars
X variable: tick code.pred
- 12 -
Apply
After some re-labelling of the graph we get (the plot is in customized windows D1)
The wider confidence bands for the unqualified reflect that there are fewer such teenagers.
Staying with this random-intercepts model, we can see the 95% coverage across
Glasgow neighbourhoods for different types of teenagers:
Model on main menu
Customised predictions
In Setup window
Tick coverage for postcode, and 95% coverage interval
Predict
Predictions sub-window
- 13 -
Across Glasgow the average probability of employment for unqualified males is estimated to
be 0.628; in the 95% worst and best areas the probabilities are 0.422 and 0.823 respectively.
Sometimes it is preferred to interpret results from a logit model as relative odds,
that is relative to some base or reference group. This can also be achieved in the customized
predictions window. First we have to estimate differential logits by choosing a base category
for our comparisons, and then we can exponentiate these values to get the relative odds of
being employed. Here we choose unqualified males as the base category so that other
teenagers will be compared to that group.
Customised predictions
In Setup window
Button logit (instead of probabilities)
Tick differences from variable Code, reference value Unmale
Untick means
Untick coverage
Predict
- 14 -
In the prediction sub window
This gives the estimated differential cluster-specific logits. Note that the logit for Unmale
has been set to zero and the other values are differential logits. These are the value given in
the model equations window as contrast coding has been used. We can now plot these
values:
Plot Grid
Y is median.pred (not mean.pred)
X is code.pred
Tick 95% confidence interval
Button error bars
This will at first give the differential logits; to get odds we need to exponentiate the median
and the 95% low and high values (from the Names window we see these are stored in c15c17)
Data manipulation
Command interface
expo c15-c17 c15-c17
After some re-labelling of the graph
- 15 -
In a relatively simple model with only one categorical predictor generating four main
effects, we can achieve some of the above calculations by just using the Calculate command
and the Expo and Alogit functions. Here are some illustrative results of doing this ‘by hand’:
Data manipulation
Command interface
calc b1 = 0.529
stores the logit unqualified male in a Box (that is a
single value in comparison to a variate in a Column)
calc b2 = alogit b1
derives the clusterspecific probability: unqualified males
0.62925
calc b1 = 0.529+ 1.149
1.6780
calc b2 = alogit b1
0.84264
stores the logit for qualified female (base + differential)
derives the c-s probability for qualified females
To calculate the odds of being employed for any category compared to the base we simply
exponeniate the differential logit (do not include the term associated with the constant)
calc b1 = 1.149
calc b2 = expo b1
3.1550
differential logit for qualified females
odds for qualified females
The full table is as follows which agrees with minor rounding error with the simulated values
- 16 -
Who?
Logit
Probability
Unqual Males
0.529
0.63
Unqual Females 0.529 + 0.149 = 0.678
0.66
QualMale
.529 + 0.996 = 1.525
0.82
QualFemale
0.529 + 1.149 = 1.678
0.84
* the odds for the base category must always be 1
Differential
Logit
0
0.149
0.996
1.149
Odds
1*
1.16
2.71
3.12
We can use the Intervals and tests window to test for the significance of difference
between gender for qualified and unqualified teenagers. NB for unqualified teenagers it is
given directly; for qualified it is not, and it has to be derived as a difference (note the -1)
The chi-square statistics are all small; indicating that there is little difference between the
genders. In contrast the differences between the levels of qualification for both males and
females are highly significant.
- 17 -
Turning now to the random effects, an effective way of presenting these is to calculate the
odds of being employed against an all Glasgow average of 1. First calculate the level-2
residuals
and store in c300, then exponeniate these values (using the command interface)and
plot them against their rank
Command interface
Expo c300 c300
- 18 -
At the extremes some places only have 0.4 of the city wide odds, at the other extreme, the
odds are increased by 1.8 with of course the all-Glasgow average being 1.
Model 2b: changing estimation
We have so far used the default non-linear options of mql, 1st order and exact binomial
distribution; clicking on the non-linear button on the equations window we can change that
to pql, 2nd order and allow extra-binomial variation, after more iterations the model
converges to
Question 3:
Have the results changed a great deal? Is there significant over-dispersion for the extrabinomial variation?
__________________________________________________________________________
- 19 -
Note that we have tested the over-dispersion parameter (associated with the binomial
weight bcons) against 1, and that there is no significant overdispersion as shown by the very
low chi-square value. Use the non-linear button to set the distributional assumption back to
an exact Binomial.
Model 3: modelling the cross-level interaction between gender,
qualifications and adult unemployment
To estimate the effects of adult unemployment on teenage employment,
In equations window
Add term to the model
Choose Adunemp
centre this variable around a mean of 8% [the rounded, across-Glasgow average].
Done
This gives the main effect for adult unemployment. We want to see whether this interacts
with the individual characteristics of qualification and gender.
In equations window
Order 1
first order interactions
Code
choose unmale as base
Adunemp
the continuous variable (the software takes account of centering)
Done
After more iterations to convergence the results are:
- 20 -
Store the model as three
Mstore "three"
This bring up the results in tabular form
Model
One
proportion
Response
Fixed Part
Constant
1.176
Unfem
qualmale
Qualfem
(adunemp-8)
unfem.(adunemp-8)
qualmale.(adunemp8)
qualfem.(adunemp8)
Random Part
Level: postcode
constant/constant
0.270
Standard
Error
Model
Two
proportion
Standard
Error
Model
Three
proportion
Standard
Error
0.075
0.529
0.149
0.996
1.149
0.118
0.148
0.149
0.151
0.705
0.048
0.866
1.078
-0.111
0.054
0.071
0.127
0.168
0.160
0.165
0.025
0.030
0.033
0.028
0.033
0.153
0.062
0.079
0.237
0.075
The results are most perhaps most easily appreciated as the probability of being employed
in a cross-level interaction plot (adunemp is a level 2 variable; code is a level-1 one variable)
Model on main menu
Customised predictions (this automatically takes account of interactions)
In Setup window
Clear [gets rid of previous choices; this must be done as specification
changed]
- 21 -
Highlight Adunemp and request Change Range
Nested means; level of nesting 1 (repeated calculation of means to get 3
characteristic values of the un-centred variable)
Done
Highlight Code and request Change Range
Click on Category and tick on each and every category for different type of
teenager (unmale etc)
Done
Confidence 95
Button on for Probabilities
Tick Medians
Tick Means
at bottom of pane: Fill grid
at bottom of pane: Predict
Predictions sub-window
The predictions are for 12 rows (4 types of teenager for each of 3 characteristic values of
adult unemployment):
To get a plot
Plot Grid
Y is median.pred (cluster specific)
X is adunemp (the continuous predictor)
Grouped by code.pred (the 4 types of teenager)
Tick off the 95% CI’s (to see the lines clearly)
- 22 -
Thickening the lines and putting labels on the graph:
Estimating the VPC
The next thing that we would like to do for this model is to partition the variance to see
what percentage of the residual variation still lies between postcodes. This is not as
straightforward as in the Normal-theory case.
- 23 -
One simple method is to use a threshold approach and to treat the level-1, between cell
variation as having a variance of a standard logistic distribution which is 3.29. 1 Then with
this model, the proportion of the variance lying between postcode is
calc b1 = 0.153/ (0.153 + 3.29)
0.044438
T
hat is 4% of the remaining unexplained variation lies at the district level. But this ignores the
fact that the level -1 variance is not constant, but is function of the mean probability which
depends on the predictors in the fixed part of the model. There is a macro called VPC.txt
that will simulate the values given desired settings for the predictor variables
Input values to c151 for all the fixed predictor values (Data manipulation and View)
EG 1 0 0 0 0 0 0 0 represents unqualified males in an area of average adult unemployment
Or
EG 1 0 0 1 0 0 0 0 represents qualified females in an area of average adult unemployment
Input values in c152 for predictor variables which have random coefficients at level 2
EG c152 1
because this a random-intercepts model
To run the Macro
File on main menu
Change to directory to something like C:\Program Files\Mlwin v2.1\Samples
Open macro vpc.txt then Execute
The result is obtained by print B8 in the Command window and then looking in Output
window.
prin b8
0.033117
which is for unqualified males, while the result for qualified females is
prin b8
0.020085
So some 2 to 3% of the residual variance lies between postcodes.
1
Snijders T, Bosker R, 1999 Multilevel analysis: an introduction to basic and advanced multilevel modelling,
London, Sage
- 24 -
Characterising the between area with effects with the Median Odds Ratio
There is a growing agreement in the multilevel modelling community that the
Median Odds Ratio of Larsen is a more effective way of portraying the higher level variance
in discrete models than the VPC.2 The MOR transform the between-area variance on the
logit scale to a much more interpretable odds scale than can be compared to the relative
odds of terms in the fixed part of the model. MOR can be conceptualised as the increased
risk (on average , hence the median) that would result from moving from a lower to a
higher risk area if two areas were chosen at random from the distribution with the
estimated level 2 variance. The formula is as follows:
(
√
)
√
√
Where
is the level 2 between postcode variance on the logit scale (this would be
(
) is the 75th
replaced with a variance function if random slopes are involved); and
percentile of the cumulative distribution function of the Normal distribution with mean 0
and variance 1. The Figure shows the relations between the three measures
Thus the MOR for Models 1 to 3 is
>calc b1 = expo((2 * 0.270)**0.5
1.6416
* 0.6745)
2
Larsen K, and Merlo J. (2005) Appropriate assessment of neighbourhood effects on individual health:
Integrating random and fixed effects in multilevel logistic regression. Am J Epidemiol, 161,81-8; Merlo J, Chaix
B, Yang M, Lynch J, Rastam L. (2005) A brief conceptual tutorial of multilevel analysis in social epidemiology:
linking the statistical concept of clustering to the idea of contextual phenomenon. J Epidemiol Community
Health 2005;59(6):443-9.
- 25 -
->calc b1 = expo((2 * 0.237)**0.5
1.5910
* 0.6745)
->calc b1 = expo((2 * 0.153)**0.5
1.4523
* 0.6745)
According to this measure there is quite a bit of area heterogeneity which is larger than the
gender gap but smaller than the qualifications gap in the relative odds. The credible
intervals for a MOR can be obtained by ‘plugging in’ the credible intervals from an MCMC.
Thus for Model estimated below with 10k MCMC monitoring simulations: the MOR and the
95% credible intervals are:
Calc b1 = expo((2 * 0.059)**0.5 * 0.6745)
1.2607
95% Lower
Calc b1 = expo((2 * 0.168)**0.5 * 0.6745)
1.4784
->Calc b1 = expo((2 * 0.317)**0.5 * 0.6745)
1.7110
95% Higher
Characterising the effect of a higher-level predictors with the Interval Odds Ratio
Larsen has also introduced a statistic he calls the Interval Odds Ratio (IOR).3 This aims to
assess the effect of higher-level cluster variables on an odds scale taking into account the
residual heterogeneity between areas. It is calculated as an interval between two persons
with differing values of the higher-level variables
and
covering the middle 80
4
percent of odds ratios
(
(
(
(
)
)
)
)
√
(
)
(
)
√
√
√
(
)
(
) and
(
) are the 10th and 90th percentiles of the Normal
Where
distribution which gives the values -1.2816 and +1.2816. If the interval contains the value 1
the effect of the higher-level variable is not strong given the residual between-area
3
Larsen K, Petersen JH, Budtz-Jørgensen E, Endahl L. (2000) Interpreting parameters in the logistic regression
model with random effects, Biometrics, 56(3):909-14.
4
The 80% is arbitrary but commonly used.
- 26 -
variation. But if it does not contain 1 the effect of the higher-level variables is large in
comparison to the unexplained between-neighbourhood variation; moving between
neighbourhood with different levels of adult unemployment is not going to be swamped by
other (unexplained) neighbourhood effects.
In Model 3 the residual between neighbourhood variance is 0.153 and the main
effect for the level 2 variable of adult unemployment is -0.111, that is the effect for an
unqualified male. We can therefor calculate the IOR 80% values for an unqualified male
teenager who lives in the lower quartile neighbourhood in comparison with the upper
quartile neighbourhood, a value for adult unemployment of 5.085% in comparison to 9.65%.
We first calculate the simple odds ratio without taking into account potential
neighbourhood differences
calc b1 = expo(-0.111 * (5.085 -9.65))
1.6598
so that the moving between neighbourhoods does appear to change the odds of
employment. However, when we additionally take into account the other potential
neighbourhood differences, the IOR80% is calculated to be
Calc b1 = expo( -0.111 *(5.085 -9.65) +
0.81691
(2 * 0.153)**0.5
* (-1.2816))
Calc b1 = expo( -0.111 *(5.085 -9.65) +
3.3725
(2 * 0.153)**0.5
* (1.2816))
and this straddles 1 . This suggests that the difference between these two types of areas is
not large relative to the unexplained effect, there is quite a lot of chance that the teenager
will not have an increased propensity of employment given the changed neighbourhood
characteristics and the variation between neighbourhoods. If we look at a more marked
neighbourhood change from the 5% best to the 5% worst this is a change of adult
unemployment from 2.958% to 15.764%, the standard odds value is
Calc b1 = expo( -0.111 *(2.958 -15.764) )
4.1432
and the IOR’s are
Calc b1 = expo( -0.111 *(2.958 -15.764) +
2.0391
(2 * 0.153)**0.5
* (-1.2816))
Calc b1 = expo( -0.111 *(2.958 -15.764) +
8.4183
(2 * 0.153)**0.5
* (1.2816))
The interval does not include 1 so that this large scale change in the neighbourhood
characteristic does increase the propensity to be employed even when account is taken of
how much neighbourhoods differ. It must be stressed that the IOR is not a confidence
interval, and care would be needed in plugging in the MCMC estimates as there not joint
estimation of the credible intervals of the fixed and random terms.
- 27 -
Comparing models: using the DIC estimated in MCMC
Unfortunately because of the way that logit model are estimated in MLwiN through quasilikelihood, it is not possible to use the usual deviance to compare models. One could use the
Intervals and Tests procedures to test individual and sets of estimates for significance. But
using MCMC methodology one can compare the overall fit of different models using the DIC
diagnostic.
Using the IGLS/ RIGLS estimates as starting values
Estimation Control
Switch to MCMC and use the default values of a burn-in of 500, followed by a
monitoring length of 5000
Start
Store results as MCMC5k
To examine the estimates
Model on main menu
Trajectories
Select the level 2 variance (Postcode: Constant/Constant)
Change Structured graph layout to ‘1 graph per row’
Done
This gives the trajectory of the estimate for the last 500 simulated draws
Click in the middle of this graph to get the summary of these results:
- 28 -
You can see that the mean of the estimate for the level-2 variance is 0.166 and that the 95%
credible interval does not include zero in going from 0.058 to 0.308; the parameter
distribution is positively skewed, and the asymmetric CO’s reflect this. Note however that
both the Raftery-Lewis and Brooks-Draper statistics are suggesting that we have not ran the
chain for long enough as the chain is highly auto-correlated; we have requested a run of
5000 simulations but they are behaving as an effective sample size of only 65. 5
Ignoring this for the moment, we want to get the DIC diagnostic,
Model on main menu
MCMC
DIC diagnostic
produces the following results in the output window
Bayesian Deviance Information Criterion (DIC)
Dbar
D(thetabar)
pD
DIC
885.76
844.88
40.87
926.63
To increase the number of simulated draws
Estimation Control
MCMC
Change monitoring from 5000 to 10000
Done
More iterations on top bar
Store results as MCMC10k
5
There are a number of recently developed procedures (discussed in Volume 1, Chapter 10 ) that we can use
to improve the efficiency of the sampling through the MCMC options. We found that for this model and for
this term there was not substantial improvement in efficiency even when orthogonal parameterization and
hierarchical centring were used in combination.
- 29 -
The trajectories will be updated as the 5000 extra draws are performed (it makes good
sense in large model to close the trajectory and the equations window down as it slows
down the model, without being really informative)
Click Update on the MCMC diagnostics
To see that there are now effectively now 246 independent draws. The MCMC results of the
two models can now be compared
and it would appear that there is very little change with increased length of the monitoring
run. Thus, the DIC diagnostic is
Dbar
D(thetabar)
pD
DIC
- 30 -
886.30
844.91
41.39
927.70
Doubling the number of draws has changed the DIC diagnostic by only a very small amount.
There are two key elements to the interpretation of the DIC:
pD
This gives the complexity of the model as the ‘effective degrees’ of freedom
consumed in the fit, this takes into account both the fixed and random part; here we
know there are 8 fixed terms and the rest of the effective degrees of freedom comes
from treating the 122 postcodes as a distribution;
DIC
Deviance Information Criterion (DIC), which is a generalisation of the Akaike
Information Criterion (AIC); The AIC the Deviance + 2p, where p is the number of
parameters fitted in the model and the model with the smallest AIC is chosen as the
most appropriate. The DIC diagnostic statistic is simple to calculate from an MCMC
run as it simply involves calculating the value of the deviance at each iteration, and
the deviance at the expected value of the unknown parameters. Then we can
calculate the 'effective' number of parameters, by subtracting from the average
deviance from the complete set of iterations . The DIC diagnostic can then be used
to compare models as it consists of the sum of two terms that measure the 'fit' and
the 'complexity' of a particular model. Models with a lower DIC are therefore to be
preferred as a trade-off between complexity and fit. Crucially this measure can be
used in the comparison of non-nested models and non-linear models.
Here are the results for a set of models, all based on 10k simulated draws. To change a
model specification, you have to use IGLS/ RIGLS estimation and then MCMC and with single
models you cannot use mql and 2nd order IGLS. The results are ordered in terms of
increasing DIC, the simplest and yet best fitting model at the top. The Mwipe command
clears the stored estimates of the models
Model
4
5
3
2
1
Terms
2level,Cons+Code+Ad-Unemp
2level,Cons+Code*Ad-Unemp
2 level,Cons+Code
2 level,Cons
1 level,Cons
PD
38.63
41.39
48.71
49.16
1
DIC
927.38
927.39
937.01
1025.44
1086.44
In terms of DIC, the chosen model is a two level one, with an additive effect for 3 categories
of code and an additive effect for adult-unemployment, although there is no substantive
difference to the model with the cross-level interactions.
- 31 -
The plot for the final most parsimonious model is given below for logits and probabilities.
- 32 -
Some answers
Question 1: Why the variability in the number of cells?
In some postcode areas there is not the full complement of types of teenager; this is a form
of imbalance. Usually, estimation is not troubled by it.
Question 2
Who is the constant?
All types of teenagers; there are no other terms in the fixed part of the model.
What is 1.176 ?
What is 0.270 ?
The Log-odds of being employed on average across all teenagers
across all areas
The between area variation on the logit scale
Does the log-odds of teenage employment vary over the city? Yes, there appears to be
evidence of this.
Question 3:
Have the results changed a great deal?
No
Is there significant over-dispersion for the extra-binomial variation?
No, 1.025 is less than a standard error away from 1. We need to compare against 1 not 0.
- 33 -
13. HIV in India: An example of a Poisson, and a NBD model
Introduction
This chapter aims to demonstrate the use of MLwiN in the fitting of two-level models to
count data. Substantively, the study aims to investigate the State geography of HIV in terms
of prevalence and how this characterized by age-groups, gender, educational levels and
urbanity. We have tried to provide an outline of a ‘realistic’ research project and include
single level models; random-intercepts and random-slopes two-levels models; Poisson,
extra-Poisson, and NBD models with an offset; estimation by quasi-likelihood (IGLS, PQL, 2nd
order) and MCMC samplers; interpretation and graphic of results both as loge rates, and as
relative risks or incidence rates; significance testing, and even models that do not converge!
The data
These are based on the nationally representative, cross-sectional data for some 100k
individuals from the 2005-2006 India National Family Health survey (NFHS-3), the first
national survey to include HIV testing. The survey was designed to provide a national
estimate of HIV in the population of women aged 15-49 and men aged 15-54, as well as
separate HIV estimates for each what was thought to be the five highest HIV prevalence
states – Andhra Pradesh, Karnataka, Maharashtra, Manipur, and Tamil Nadu – and for one
low HIV prevalence state – Uttar Pradesh. In the remaining 22 states, HIV testing was
conducted in only a sub-sample of six households per enumeration area. The dependent
variable indicates HIV sero-status. Details of this procedure and of the sampling design are
given in Khan (2008) and the manual that accompanies the NFHS-3.6
The initial worksheet



File on main menu
Open worksheet
Filename
Hivcounts
6
Khan, K T (2008) Social Determinants of HIV in India, Masters of Science Thesis,
Harvard School of Public Health; National Family Health Survey (NFHS-3), 2005-2006: India: Volumes I and II.
2007,IIPS: Mumbai.
- 34 -
The 1720 observations represent cells of a complex table which represents the crosstabulation of 28 States by 4 Age-Groups by 4 Educational levels by 2 Sexes by 2 Urbanity
groups. Cells are therefore a group of people who share common characteristics. The
potentially full table (28*4*4*2*2) of 1792 cells has not been observed because in the
sample not all combinations of these variables have been found.
In the Names window


Highlight all 7 variables
View
Gives the following data extract
The first row shows that in the State of Andhra Pradesh, 32 people who shared the
characteristics of being under 24, having been educated to a High level, being female and
living in a rural area were interviewed, and that none of them tested sero-positive for HIV.
The fourth row, again in the State of Andhra Pradesh, represents 1009 people who shared
the characteristics of being under 24, having been educated to Secondary level, being
female and living in a urban area. Three of such people were tested sero-positive for HIV.
Such data with a low count based on a large denominator are highly suited to Poisson and
NBD modelling.
To get an initial idea of how rare HIV is in the population we can sum the number of
Cases and the number of Cases+NonCases and calculate the overall ratio



Data Manipulation on Main Menu
Command interface
and enter the following commands into the lower box, one at a time and press
return
Sum 'Cases' b1
Sum 'Cases+NonCases'
calc b3 = b1/b2
b2
- 35 -
B1, B2 and B3 represent boxes when the answers are stored; boxes are single values (a
scalar) as opposed to c1, c2 etc. which are columns (or variables).
In the output window you should see the following results
->Sum 'Cases' b1
467.00
Overall only 467 sero-positives were found.
->Sum 'Cases+NonCases'
1.0235e+005
b2
And this is from a survey of over 102 thousand people!
->calc b3 = b1/b2
0.0045629
Giving thankfully a rate of only 0.00456.
Question 1
Use the command interface to calculate the rate per 10,000
___________________________________________________________________________
Defining the exposure
The observed counts have to be compared to the number of people in the different groups
who have been observed or ‘exposed’. That is while we may get a relatively high count of
cases, this may simply represent that we have observed a large number of people for this
type of cell. Thus the relatively high count of 3 cases in row 4 may not represent a high
prevalence but simply that there are a lot of people that share these characteristics. We
can overcome this problem by defining an expected count and compare the observed count
with this value. Using the command window line we can calculate the expected count as the
national rate (b3) times the number of people interviewed (Cases+NonCases), and also the
Standard Morbidity rate as the observed (Cases) divided by the Expected, if the national rate
applied
calc
name
calc
Name
c8
c8
c9
c9
= b3 *'Cases+NonCases'
'Expected'
= 'Cases' / 'Expected'
'SMR'
The revised worksheet is
- 36 -
and the revised data extract including the new variables
We can see that the observed count of 3 cases in row 4 is less than the expected number of
4.6 cases given who we had interviewed, and hence we have a SMR below 1; the morbidity
in this group is less than the national average. While we can clearly calculate a SMR we
should not place a great deal of weight on it, as it is very sensitive to the ‘small number
problem’. Thus, if we look at row 6, we see that the SMR for this group of people is 1.543,
that is nearly 50% in excess of the national rate, but this rate is very unreliable and small
changes in the number of cases would lead to very different SMR’s. If the one observed case
was zero, then the SMR would plummet to zero, but if a single extra case was observed the
SMR would be three times the national rate (2/0.648 equals 3.08). Clearly in this form the
SMR is highly troubled by the stochastic nature of the data and we require a model to
provide a more robust inferential framework.
As an aside, we have used a simple way of calculating the expected values on the
basis of a national rate. We could have produced a more elaborate procedure. Thus we
could have produced an expected value based on the national rate for different age-sex
groups
- 37 -
Age-sex group
> 24 Female
> 24 Male
25-34 Female
25-34 Male
34-44 Female
34-44 Male
45-54 Female
45-54 Male
Cases
38
24
93
119
46
102
11
34
Respondents
19576
17099
16416
13974
12390
11333
4085
7475
Rate
0.0019
0.0014
0.0057
0.0085
0.0037
0.0090
0.0027
0.0045
But doing so would have effectively removed the effect of age and sex and this is something
that we are interested in.
Model 1: a single level model for Gender
The first model we will fit will be a Poisson model for the observed count, with an offset of
the expected value based on the national rate. The model will estimate the comparative
rate for men and women separately. We begin by creating a Constant, a vector of 1’s, and
storing in c10


Data manipulation
Generate vector
And naming the c10 column to be ‘Cons’
We can the specify the model




Model on main menu
Equations
y, the response is Cases
single level model, i, with Cons declared to be the level 1 identifier [we can do this
for at level 1, any arbitrary column can be used with this software]
- 38 -



Click on N for Normal and change response type to be Poisson
Change the red xo the be Cons, done
Double click on Estimates
To give the following non-converged model:
Where  is the underlying mean count, and the variance of the observed Cases is equal to
this mean.
Clicking on the  in the second line of the window allows us to specify the offset, but as the
window warns we first have to take the loge of the expected value



Data Manipulation on Main Menu
Command interface
enter the following command into the lower box, one at a time and press return
Name c12 'LogeExp'
calc 'LogeExp' = loge('Expected')
return to the clicking on the  in the second line of the equations window, we can now
specify the loge(offset), followed by done to get the revised equation
- 39 -
It is noticeable that there is no coefficient associated with the offset as this value is
constrained to be 1. If you look at the Names window; you will see that MLwiN has created
two new variables in the background ‘Offs’: holds the values of the offset variable and
‘bcons.1’ is a placement for the Poison weight which is used during estimation; do not
delete these variables.
Return to the equations window



Add term
Variable, choose to be Gender and reference category to be Female
Start iterations to convergence
The estimated model is
The positive value (+0.445) informs us that the loge mean count for Males is higher than
that for the base category of Females; there is a higher prevalence of HIV across all of India
for men, contrasted with the base of women.
To help interpret the model, we will use customized predictions to get an Incidence
Rate Ratio for men compared to women.






Model on Main menu
Customised predictions
In the setup window, click on Gender, and Change range
Change Means to Categories and by click on Male and Female, Done
Choose Rate and click on medians (in single-level models the medians and means of
the predictive distribution will gave the same results)
Click on Differences and choose the From Variable to be Gender with the reference
value Female
The completed window should look like
- 40 -



Fill grid
Predict
Move to the Predictions sub-window
On the log-scale, the Males are 0.455 higher than the females. If we exponentiate this value
and the associated up and lower 95% confidence intervals in the command interface
expo c17-c19 c17-c19
(you should check on the Names window to see where the predictions are
stored)
We get the transformed predicted values (you may have to close and re-open window to get
the updated values
- 41 -
So that when the rate for Females is set to 1, the rate for Males is some 56% higher. We can
now plot this value and the 95% confidence intervals
 Plot grid
 The y axis variable is median.pred
 The x axis variable is gender.pred
 Tick 95% confidence intervals
 Check error bars
 Apply
This sends the graph commands to Graph display 1 and plots the graph; with re-labeling and
titling, choosing Red as the colour on the Plot style tab, and ticking off group code on the
Other tab we get:
- 42 -
Clearly there is a significant difference between Men and Women in the incidence rate ratio
of HIV as the 95% confidence interval does not include 1.
Question 2
What would be the results if Males had been chosen for the reference category and not
Females?
__________________________________________________________________________
So far we have found that the Male rate is 1.56 times higher than the Female rate when the
Female rate is set to 1. We can also estimate both the Male and Female comparative rates
(the Standardised Morbidity Rate or the relative risk) in relation to the overall population.
That is the overall population seropositive rate of both men and women is set to 1, and we
are going to compare Men and Women to that figure. Using the results of Model 1 we
change the settings of the customized predictions window





Model on Main menu
Customised predictions
In the setup window, click on Gender, and Change range by clicking on Male and
Female, Done
Choose Rate and click on medians (in single-level models the medians and means of
the predictive distribution will gave the same results)
Click off the Differences (this is the key change that allows the comparison with
overall rate)
The completed window should look like

Fill grid
- 43 -


Predict
Move to the Predictions sub-window
So if the overall rate is 1 (this is given by constraining the log offset to 1), males are 23%
higher, while Females have a relative risk that is only 78% of that for the general population
(you figures may differ slightly from this due to the simulation procedure used in MLwiN to
derive these figures). We can relate these results to the earlier ones and calculate the
relative risk for Men compared to Women by using the command interface
calc 1.228/0.787
1.5604
The Male rate is indeed 1.56 times higher than the Female rate, as we found earlier.
We can plot the relative risk and the 95% confidence intervals for each gender in
relation to the overall rate






Plot grid
The y axis variable is median.pred
The x axis variable is gender.pred
Tick 95% confidence intervals
Check error bars
Apply
- 44 -
Which after a little editing gives
As the confidence bands do not overlap, Men are at a significantly higher risk of being seropositive than women. The similar lengths of the confidence bands reflect the similar number
of men and women in the survey.
Model 2: a single level model for Gender and Age Main effects
Returning to the model with Females as the base, we can add the main effects for different
age groups (Age Groups) choosing under 24 years of Age as the reference group. After
convergence we find
The positive values for the differential age categories and Male mean that the lowest log e
rate is for females aged under24 years. We can use the customised predictions window to
calculate on the relative risks for the different age-sex groups in comparison to the overall
population which is given by the loge offset being constrained to 1):
- 45 -

Clear the old specification and then specify the window as follows, ensuring that
Age group is changed so that the predictions are made for all 4 categorical age
groups as well as for both categories of sex, make the predictions for Rates for the
medians and the 95% upper and lower confidence bands; Fill the Grid and make
Predictions
We get the predicted values which are the Standardized Morbidity Rates or relative risks for
HIV for each age-sex group in relation to the overall population.
When compared to the overall national rate set to 1, 25-34 year old men have nearly double
the incidence, while Females under 24 only have a third of the national rate. To derive a plot
that contrasts men and women at different ages:




Plot grid
The y axis variable is median.pred
The x axis variable is gender.pred
Grouped by Gender.pred
- 46 -



Tick 95% confidence intervals
Check error bars
Apply
Men generally have a higher rate than women, but the difference is only significant for the
two middle-age groups. The under 24 age-group for both men and women have the lowest
rates.
Question 3
Make a plot that contrasts the SMR at different ages for men and women separately.
___________________________________________________________________________
Model 3: a single level model for Gender and Age with interactions
Model 2 included Gender and Age as main effects so that the model is additive on the loge
scale. The differences between men and women at different ages in the above diagram are
simply a result of differential ‘stretching’ when the loge-rate is exponetiated. Now we will fit
a model with interactions between Age and Sex; this will allow the gender differences even
on the loge scale to be different for different age-groups, or equivalently, the differential age
effects to be different by gender. Return to the equations window
- 47 -


Click on Add term
Order 1 for 1st order interaction
Variable Gender choosing Female as the reference category
Variable AgeGroup choosing <24 as the reference category
Done
After more iterations the following estimates are found
Question 4
What type of person is the constant in this model?
___________________________________________________________________________
Using the customized predictions we can make a set of predictions of the relative rates or
SMR’s, plotted for each age-sex category as shown below, stressing the age-group
differences.
Or alternatively stressing the gender differences.
- 48 -
There have been quite a few changes as compared to the simpler main-effects model with
the lowest rates of all being found for young men; the biggest gender gap is for the 34-44
year olds, the males of this group having the highest rates of all.
Model 4: a single-level model for Gender-Age interactions and Education
Return to the equations window



Add term
Variable choose to Educ and reference category to be High
More iterations to convergence
The estimated model is
A quick inspection of the results shows that the other three education categories have a
significantly higher loge rate than those who have received higher education. But the three
- 49 -
may not be significantly different from each other. We can test this using the intervals and
test window. First we will test that each of the contrasted categories are significantly
different to the base category of Higher.





Model on Main menu
Intervals and tests
Tick Fixed and set number of functions to 3
Place a 1 in the ‘request’ matrix for each of the hypotheses that are to be tested
Click calculate
It can be seen that each separate chi-square is significant with one degree of freedom, and
the joint test of the three is also significant; here are the p values found using the command
interface
cpro
13.266 1
0.00027026
[testing whether those Secondary education are different to Higher]
[very small p value, so cannot reject the null hypothesis of no
difference]
cpro 22.843
1.7579e-006
[testing whether those Primary education are different to Higher]
cpro 16.604 1
4.6054e-005
[testing whether those No-education are different to Higher]
cpro 23.909 3
2.6097e-005
[testing whether all three are different to Higher]
There is very strong evidence that Indians who have received a higher education have
significantly lower risk. We can now test whether the differences between the three lower
educational categories are different.
- 50 -




Intervals and tests
Tick Fixed and set number of functions to 0 to clear, and then to 3
Place a 1 in the ‘request’ matrix for each of the hypotheses that are to be tested and
a -1 to signify that we are testing a difference
Click calculate
The joint test that all three differences are equal to zero cannot be rejected;
cpro 5.247 3
0.15458
consequently in the interests of parsimony we will combine the 4 four Education groups into
two
 In the equations window
 Click on the NoEd term
 Delete term, confirming that you want to delete all 3 categories from the model







Data manipulation on main menu
Recode
By value
Choose Educ as source column
A free column as destination (here c27)
Give the new values such that High gets a 0 all others are set to 1
Execute
- 51 -





In the Names window,
Highlight c27 and Edit name to Educ2
Keeping the variable highlighted
toggle Categorical to true
click on Categories and give meaningful codes to the categories
High is 0
Low is 1
OK
In the Equations window
 Add term
 Variable choose to Educ2 and reference category to be High
 More iterations to convergence
This looks a highly significant difference and we can see if this differential is to be found for
each age-sex category
In the Equations window
 Click on Add term
 Order 1 for 1st order interaction
Variable Educ2 choosing High as the reference category
Variable Gender choosing Female as the reference category
Done
- 52 -
More iterations to convergence.



Click on Add term
Order 1 for 1st order interaction
Variable Educ2 choosing High as the reference category
Variable AgeGroup choosing <24 as the reference category
Done
More iterations to convergence
Stop as the model is not converging
On inspection, the Low-45-54 interaction term is not converging due to its large standard
error. Click on this term and remove it from the fixed part by unticking (do not delete as this
would remove all interactions between Low education and the three contrasted age
groups).
More iterations to convergence, but this also results in a problem, so Start instead of More
(this will start model estimation from scratch)
On inspection, we see that none of the Educ2 and AgeGroup interactions are significant,


Click on Low.33-44
Delete, confirming that all three interaction terms need to be deleted
After more iterations the following estimates are found
- 53 -
We have gone back to a model only involving the interaction between Education and
Gender, but we also know that Age effects are not differentiated by Education. (We also
tried the second order interactions between Age*Educ2*Gender but none of these proved
to be reliably estimated.)
We can see these results as risks by using the customised predictions window to
make projections for all 16 groups (2Education * 2Gender *4AgeGroups). You will have to
Clear the old specification and Change Range so that Means are replaced by Categories and
each and every category is ticked for all three variables.
- 54 -
To display a plot that displays the effects of the Education and Gender interaction







Plot grid
The y axis variable is median.pred
The x axis variable is Educ2.pred
Grouped by agegrp.pred
Trellis in the X direction: gender.pred
Tick 95% confidence intervals
Check error bars
- 55 -
Clearly, highly educated females have the lowest risk and that is across the age-groups. As
we have now completed the analysis of individual (cell-level) characteristics, we shall store
the model.


In Equations window
Store model, giving the label Four
Model 5: a two-level model for States
Return to the Equations window
 Click on Cases
N levels change to 2 – ij
Choose States to be level 2
Choose Cons to be level 1
Done


Click on Cons
Tick on j(States)
Done (you can now check the hierarchy, for 28 States; if not, the cause is probably
unsorted data)
Click on Non-linear estimation at the bottom of the equations pane
- 56 -
Click on 2nd order lineralization
Click on PQL estimation
Using the IGLS quasi-likelihood approach, the preferred estimation is PQL, 2nd order (MQL
has a tendency to overestimate the higher-level variance in a Poisson model). You will see
the State level 2 residuals and the higher-level variance being included in the model. More
iterations to convergence.
We can store this model as FivePQL. There is clearly a sizeable between-State variance. We
can test this for significance.





Model on Main menu
Intervals and tests
Tick Random and set number of functions to 1
Place a 1 in the ‘request’ matrix for the hypotheses to be tested, that is picking out
the level-2 variance
Click calculate
- 57 -
The chi-square test with one degree of freedom confirms there is significant between State
variation, with a small p value.
CPRO 6.340 1
0.011804
Just like we did for the binomial model of the last chapter, we can also transform the
level 2 variance on the log scale to an odds ratio to more readily characterise higher-level
variance. Larsen calls this equivalent to the MOR for the Poisson model, the Median Mean
Ratio. (MMR). 7 The measure is always greater than or equal to 1. If the MMR is 1, there is
no variation between States. If there is considerable between-cluster variation, the MOR
will be large. The measure is directly comparable with fixed-effects relative risk ratios. It is
calculated in the same way as the MOR:
√
Where
is the level 2 between State variance on the log scale (this would be replaced
with a variance function if random slopes are involved).
calc b1 = expo((2 * 0.861)**0.5
2.4233
* 0.6745)
Thus there is considerable differences remaining between States even after Age, Gender
and Education of individuals is taken into account; indeed this median difference for States
is comparable to the largest difference between the fixed effects.
Having calculated an overall measure of the difference between States, we will now the see
the extent that the relative risk of positive status varies by each State.






7
Model on main menu
Residuals
Change to 1.96 times the SD comparative error to give 95% confidence intervals
Level: change to 2: States
Set columns, noting where different values are being stored
Calc
http://www.bgfa.ruhr-uni-bochum.de/statmetepi/2006/Vortrag_Larsen.pdf
- 58 -
We have previously stored the State names in short form (one for each State, not one
for each cell) in c299, so we can view the State name, the residual (the difference from
the national rate on the loge scale), the 1.96 standardised residual, and the rank.
We can plot the residuals and the 95% confidence intervals against the rank
- 59 -
Clearly, 5 states are significantly above the national rate represented by the value 0.
We can turn the residuals into relative risks by exponetiating them and also calculate the
upper and lower confidence intervals on this scale. First close the graph display, then in
the command interface window
calc c302= expo(c300 - c301) - (expo(c300))
calc c301= expo(c300 +c301) - (expo(c300))
calc c300= expo(c300)
the differential lower 95% RR
the differential upper 95% RR
the relative risk
and change the customised graphics in D10, so that Stateshort is given as the group
- 60 -
The lower error bars for 95% CI’s are in c302
Tick the group code on in the Other sub-window
And in relation to labels, text labelling using the ‘on the graph’ method, Done, Apply
- 61 -
To get the actual values used in this graph, in the command interface
calc c302= expo(c300 - c301) - (expo(c300))the differential lower 95% RR
calc c301= expo(c300 +c301) - (expo(c300)) the differential upper 95% RR
calc c300= expo(c300)
the relative risk
Copy 3 c299 c300 c301 c302
Where the 3 in the last command results in the copying of textual labels where
appropriate. After sorting we get the following results
State
Mani
AndP
Maha
Karn
Mizo
Goa
Madh
Tami
HimP
Jam&
Assa
Punj
Oris
WBen
Guju
Delh
Jkar
Hary
Megh
Chat
Relative
risk
5.79
3.83
3.63
3.23
3.14
1.71
1.52
1.44
1.25
1.21
1.07
1.01
0.91
0.86
0.83
0.73
0.68
0.66
0.62
0.58
Upper
95%CI
9.35
6.16
5.81
5.31
8.12
4.51
3.39
2.48
3.85
3.71
3.20
3.01
2.68
2.27
2.71
2.79
2.54
2.45
2.97
2.08
- 62 -
Lower
95% CI
3.59
2.38
2.27
1.97
1.21
0.65
0.68
0.83
0.41
0.40
0.36
0.34
0.31
0.32
0.25
0.19
0.18
0.18
0.13
0.16
Rank
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
Raja
Trip
AruP
Sikk
UttP
Utta
Biha
Kera
0.58
0.51
0.50
0.49
0.46
0.45
0.43
0.40
2.07
2.25
2.21
2.1
0.85
1.89
1.75
1.61
0.16
0.12
0.12
0.11
0.25
0.11
0.10
0.10
8
7
6
5
4
3
2
1
As anticipated there are high relative risks for Manipur, Andhra Pradesh, Maharashtra, and
Karnataka, and a confirmed low risk in Uttar Pradesh, but the risks for Tamil Nadu are not
particularly high while that of Mizoram is rather unexpectedly elevated, being the fourth
highest rate (the large CI is due to a relatively small sample of 647 individuals; this was not a
State that was highly sampled). It might be useful to have a look at the raw counts and the
simple un-modelled risk; this requires a little amount of work as the data are organized by
cell and not by State. In the command interface window
MLSUm 'States' 'Cases' c31
Multilevel sum over cells by States for cases
MLSUm 'States' 'Cases+NonCases' c32 Multilevel sum over cells by States for respondents
TAKE 'States' c31 c32 c31 c32
Take (ie unreplicate) the first entry for each State and put
back into c31 c32
calc c33 = c31/c32
Calculate the rate
Name c31 'Sero' c32 'Samples' c33 'Rates'
copy 3 c299 c31 c32 c33
Copy the State name and values
Stateshort
AndP
AruP
Assa
Biha
Chat
Delh
Goa
Guju
Hary
HimP
Jam&
Jkar
Karn
Kera
Sero
101
0
3
0
1
1
5
2
1
3
3
1
66
0
Samples
12928
653
1163
1062
1198
892
1077
1132
961
963
1053
903
9608
1212
Rates
0.0078125
0
0.0025795356
0
0.00083472452
0.0011210763
0.0046425257
0.0017667845
0.0010405828
0.0031152647
0.0028490028
0.0011074197
0.0068692756
0
- 63 -
Madh
Maha
Mani
Megh
Mizo
Oris
Punj
Raja
Sikk
Tami
Trip
Utta
UttP
WBen
8
113
90
0
6
3
3
1
0
34
0
0
18
4
2338
15523
7991
399
647
1425
1319
1239
721
11003
664
938
21166
2170
0.003421728
0.0072795209
0.01126267
0
0.00927357
0.0021052631
0.0022744504
0.00080710248
0
0.0030900664
0
0
0.0008504205
0.001843318
The elevated modelled risk for Mizoram is based on only 6 cases out of a sample of 647
respondents, but should clearly be investigated further.
Comparing alternative estimates of Model 5: a two-level model for States
It is interesting at this point to see the degree of over-dispersion in the Poisson model and
then have a look at the NBD model, and then use MCMC methods to assess the robustness
of the estimates.






In the Equations window
Click on Nonlinear
Tick on extra Poisson
Done
More iterations to convergence
In the Equations window Store the model and label as Extra
The results are as follows
- 64 -
There is some residual under-dispersion residual at the cell level. This may be a result of not
taking account of the clustering of cells within enumeration areas within States which was
the multistage design of the Health Survey. We can test if this under-dispersion parameter is
different from the null value of 1 (specified in the Intervals and test window by the
constant(k)), and find that this un-dispersion is not significant at conventional levels.
CPRO 1.337 1
0.24756
Proceeding to the NBD model







In the Equations window
Click in the Nonlinear window
Tick off extra Poisson; Done
In the equations window Click on Poisson; Choose –ve Binomial (we still have the
offset)
More iterations to convergence
In the equations window Store the model and label as NBD
In the command interface, type mcomp “NBD” to display and copy estimates
Examining the stored model reveals that the over-dispersion parameter
- 65 -
Fixed Part
Cons
Mal
25-34
34-44
45-54
Mal.25-34
Mal.34-44
Mal.45-54
Low
Low.Mal
Random Part
States: Cons/Cons
bcons.1/bcons.1
bcons2.1/bcons2.1
Model NBD
Standard
Error
-3.012
0.399
1.026
0.578
0.221
0.817
1.227
0.917
1.440
-0.769
0.508
0.552
0.227
0.251
0.367
0.344
0.363
0.473
0.435
0.489
0.870
1.000
0.185
0.347
0.000
0.071
labelled bcons2.1/bcons2.1 at 0.185 is not very different from null value of zero, given
the standard error of 0.071. A Poisson model would seem appropriate, for these data.
Finally in this comparison of the estimation of thee two-level model we will now use
MCMC methods to fit a model with an assumed Poisson distribution at the cell level








In the Equations window, click on –ve Binomial
Choose Poisson instead [MCMC is not enabled for NBD models]
More iterations to convergence as the IGLS results are required as starting values for
the MCMC procedure
Click on Estimation control
Click on the MCMC Estimation tab
Increase the chain monitoring length to 50k as Poisson models tend to have
correlated chains
Change thinning to 10, so that only 1 in 10 simulations are stored [calculations are
done on the complete chain]
Done
- 66 -


Start
After the 50k has been completed, in the equations window Store the model and
label as MCMC
The program uses the IGLS, PQL, 2nd order estimates as starting values, burns in for 500
simulations which are discarded, followed by 50k monitoring simulations. On completion
the estimates are as follows (they are still blue because MCMC has stochastic
convergence to a distribution, not as in IGLS which has deterministic convergence to
point estimates).
As the focus of interest is the State-level variance we need to see whether we have run a
sufficient length of chain, before comparing the results from different models.






Model on main menu
Trajectories
Click on Select on the bottom of the Trajectories window
Choose States: Cons/Cons, that is the higher-level variance
Change the structured graph layout to 1 graph per row to see the plot in detail
Done
To get a detailed plot of the simulations for this parameter
- 67 -
Click in this graph to get a summary of the simulated values, the MCMC diagnostics
This plot shows a pretty healthy trace (top left) as the MCMC sampler explores the
parameter space, the estimates are not too highly auto-correlated (middle left) and the
Effective Sample Size is very respectably in excess of 2000. All this suggest that we have run
the chain sufficiently. The smoothed histogram (top right) shows a marked positive skew for
the degree of support for the parameter. The posterior mean at 1.105 is larger than the
mode at 0.905 (reflecting the skew); the 95% credible intervals (0.461 to 2.380) do not
include 0, so we have strong evidence that there are genuine differences between the
States in terms of relative risk of being sero-positive.
We can also see what a poorly estimated parameter looks like:






Model on main menu
Trajectories
Click on Select on the bottom of the Trajectories window
Choose Fixed: Male, that is the male differential for <24 years
Change the structured graph layout to 1 graph per row to see the plot in detail
Done
- 68 -
To get a detailed plot of the simulations for this parameter,
Click in this graph to get a summary of the simulated values, the MCMC diagnostics
This plot shows an unhealthy trace (top left); there is evidence of ‘slow drift’ as if the
sampler is getting stuck in one part of the parameter space rather than rapid exploration;
successive estimates are highly auto-correlated (middle left) and the Effective Sample Size is
only 58 even after 50k simulations. The smoothed histogram (top right) shows a distribution
that includes zero, with the 95% credible intervals ranging from -0.554 to 1.616. If this was a
crucial parameter we may want to run the sampler for a large number of further iterations;
but it simply looks as if there is little evidence that this parameter is a sizeable one; a
sizeable difference between Male and Female for the under 24’s receives very little
empirical support. We will briefly come back to more efficient samplers at the end of this
chapter.
We now compare the estimates from the different procedures:
 Model on main menu
 Compared stored models
- 69 -

(or just to get the 2-level models mcomp 2 3 4 5 in Command interface)
Copy puts the values, so that they can be pasted to a word-processor to form a table
There has not been a great deal of change in the fixed part estimates and the standard
errors using PQL 2nd order (Poisson, extra-Poisson, NBD) and MCMC (Poisson). The higherlevel random-part estimate for the between-State variance is also very compatible across
the three PQL 2nd order estimates, reflecting the insignificance of the extra-Poisson term.
The mean of the MCMC estimates at 1.105 is somewhat higher reflecting the positive skew
distribution of the support distribution, but so is the computed standard error. Given that
there are only 28 higher-level units, it would sensible to report the MCMC results, but
pointing out that PQL, 2nd order modelling had found no evidence of over- or underdispersion. You could obtain more summary detail about the MCMC estimates by ticking on
the ‘extended MCMC information’ in the Manage stored models window.
Question 5
Using the MCMC results, what are the State differential risks on the log e scale, and as
relative risks, do they change a great deal from the Poisson model estimated by 2 nd order
PQL?
Model 6: between State differences for Urban-Rural
The final model to be fitted estimates the differences in relative risk between Urban and
Rural areas, and allows these differences to vary over States. Urban/rural status is really a
higher-level variable measured at the enumeration district, but this is not available to us
here, so we are in effect treating it (somewhat in-appropriately) as an individual cell-level
effect. We will use the MCMC procedure but we have to begin with IGLS model to get
starting values. We will also use a specification that gives the variances directly (rather than
as a base and a contrasted dummy) in the level-2 random part, but contrast Urban with
- 70 -
Rural areas in the fixed part of the model.8 This specification allows us to calculate and plot
the correlation between the relative risk for urban and rural areas in each State.
We have first to remove the random intercepts associated with the constant at level 3







In main menu, click on Estimation, and change back to IGLS
Equations window
Click on Cons and remove the tick for j(State) this will remove the level 2 variance
term for the State random intercepts, Done
Add term, choose variable to be Urbanity; reference category to be None (scroll up
to get this) Done; this will create two separate dummies and place both in the model
with the names of Rural and Urban
Click on Rural, and tick of Fixed parameter (Urban will now be contrasted) with Rural
in the fixed part), and tick on j(States), so that a new random term will be included
at the higher level, Done
Click on Urban, and leave the tick for Fixed parameter (urban will now be contrasted
with Rural in the fixed part), and tick on j(States), so that a new random term will be
included at the higher level, Done
Start iterations to convergence (it has a problem getting started but it does get
there]
From this we can see that Urban-based individuals have a significantly higher risk on the
loge scale (0.597 is more than twice 0.198) but rural-based individuals have the greater
between-state variance. Now estimate this model with MCMC
 Click on Estimation control
 Click on the Estimation tab
8
The major drawback of this separate specification approach to random effects is that developing the model
does not have a parsimonious route. If we wanted to include differential random effects for States for gender
and urbanity, we would need to estimate variances and co-variances for 4 random effects (2 genders times 2
urbanity groups).
- 71 -




Increase the chain monitoring length to 50k, as Poisson models tend to have
correlated chains
Change thinning to 10, so that only 1 in 10 simulations are stored
Done
Start
After the 50k has been completed, the results look like, store the model as RurUrb
Again the MCMC estimated random terms are larger than the IGLS ones, but also have a
larger calculated standard error. Comparing the two MCMC models
mcom "MCMC" "UrbRur"
Model
MCMC
Fixed Part
Cons
-3.237
Mal
0.522
25-34
1.086
34-44
0.633
45-54
0.232
Mal.25-34
0.756
Mal.34-44
1.244
Mal.45-54
0.922
Low
1.545
Low.Mal
-0.858
Urban
Random Part
States Cons/Cons 1.105
Rural/Rural
Urban/Rural
Standard
Error
Model
UrbRur
Standard
Error
0.557
0.548
0.193
0.223
0.347
0.299
0.327
0.436
0.450
0.498
-3.612
0.516
1.046
0.591
0.171
0.791
1.281
0.976
1.586
-0.886
0.636
0.593
0.570
0.193
0.222
0.355
0.296
0.316
0.444
0.475
0.522
0.294
1.771
1.080
0.872
0.522
0.512
- 72 -
Urban/Urban
Level: Cons
Cell
DIC:
1.000
1204.876
0.000
0.867
0.421
1.000
1197.336
0.000
We see that the DIC has decreased even though there has been an increase of 3 nominal
parameters (fixed part, and 2 additional variance –covariance terms); the new model is a
better fit to the data.
We will start the interpretation with the fixed part term for Urban; here is the
monitored chain for 50k
And for 100k
And for 200k
- 73 -
Although, as would be expected, the Effective Sample Size increases substantially, there is
no substantive change in the parameter estimates, in terms of mean, mode, standard
deviation and quantiles. There is some support that the loge rate is higher in urban than
rural, but the evidence is not overwhelming (and that is without taking account that the
standard error is likely to be underestimated due to the multistage design). In the
customised predictions window we can estimate the median log rate for urban living as a
difference from rural.
Using the command interface we can exponentiate to get the relative risk or Incidence
rate ratio of Urban living compared with Rural
expo c23 c30 c40 c23 c30 c40 (check columns for median, low and high)
and plot the results; by using the median we are using the cluster specific estimates and
not the population average ones
- 74 -
Clearly the relative risk is higher for Urban living (the point estimate for the relative risk
is 88% higher compared to rural living), but there are wide confidence intervals.
Turning now to the State differentials, we can examine the pair of residuals
 Model on main menu
 Residuals
 Request the level 2 State residuals and the 1.96 SD(Comparative residual)
- 75 -
And plot each pair as ‘caterpillar’ plots
Which after putting on State names on both graphs (In Plot what? tab, Group is
Stateshort; in Other tab, tick on group code)
- 76 -
The differences are more easily seen on a pairwise plot of the residuals that graphs the
State differential against each other (again using Stateshort as the group code)
From examining the axes, it is clear that the differentials in the relative risks are much
greater for rural as opposed to urban living. The patterns are similar for both types of
living with something of an exception for Tamil Nadu, in which rural living has a
comparatively higher risk. Using the Estimate tables, we can see the estimated
correlation between the two differentials on the log-scale to be 0.87 so that the patterns
are very similar.
- 77 -
Finally we can exponentiate the residuals to get the relative risks and update the plot.
These risks are for different States relative to living in urban and rural areas nationally.
Question 7
You could now fits models by including additional fixed part terms to estimate interactions
between gender and urbanity, age and urbanity. It is also possible to investigate other
between-State differences in relation to gender and age-group.
More efficient MCMC samplers
In the Urban-Rural model the fixed part Urban differential with a chain length of 50k was
found to be
Given the ESS of only 201, this is a very inefficient sampler. To access the new sampling
schemes
 Model on Main menu
- 78 -
MCMC
MCMC options
Click on Use hierarchical centring at the highest level. here 2
Click on Use orthogonal parameterisation
Done
Start
Using the trajectories window re-examine the monitored chain for the fixed part differential
for Urban.
In effect the efficiency of the sample has been doubled by combining the two procedures.
Further details are given in the 2009 MCMC manual. Our advice is that you routinely use
these options.
- 79 -
Some answers
Question 1
Use the command interface to calculate the rate per 10,000
calc b4 = b3 *10000
45.629
In sampling 10,000 individuals we could anticipate only finding some 45 people with HIV.
Question 2
What would be the results if Males had been chose for the reference category and not
Females?
The estimated log values are now minus 0.445 for Females, so that the log rate for females
is lower than that for males. The customised predictions need to be compared to males
- 80 -
The results are then exponeniated
expo c19-c21 c19-c21
to get the IRR and the 95% CI’s
Gender.pred
Fem
Mal
Cons.pred
1
1
median.pred
0.64245391
1
median.low.pred median.high.pred
0.53462911
0.76497763
1
1
and a revised plot made with appropriate re-labeling and re-scaling
When the IRR for males is set to 1, that for women is 0.64, a significant and sizeable
reduction. Before proceeding, re-specify the model so that Female is the base.
Question 3
Make a plot that contrasts the SMR at different ages for men and women separately. This
can be achieved by plotting the median predictions against Gender on the X axis and
grouping by Age
- 81 -
While the patterns are the same for both sexes; the consistently higher rates are found for
men for each age group.
Question 4
What type of person is the constant in this model?
The type of person for whom all the other variables are zero,that is a Female aged under 24
years
Question 5
Using the MCMC results what are the State differential risk on the loge scale, and as relative
risks, do they change a great deal from the Poisson model estimated by 2nd order PQL?
- 82 -
State
Mani
AndP
Maha
Karn
Mizo
Goa
Madh
Tami
HimP
Jam&
Assa
Punj
RR-McMC
6.32
4.18
3.95
3.52
3.52
1.89
1.66
1.56
1.35
1.31
1.14
1.07
+95CI’s
10.85
7.15
6.73
6.15
9.43
5.01
3.78
2.82
4.14
4.02
3.38
3.18
- 83 -
-95CI’s
3.68
2.44
2.32
2.02
1.31
0.71
0.73
0.87
0.44
0.43
0.39
0.36
Rank
28
27
26
25
24
23
22
21
20
19
18
17
Oris
WBen
Guju
Delh
Jkar
Hary
Chat
Megh
Raja
UttP
AruP
Trip
Sikk
Utta
Biha
Kera
0.97
0.90
0.86
0.71
0.66
0.65
0.57
0.57
0.56
0.49
0.46
0.45
0.43
0.39
0.38
0.35
2.85
2.38
2.81
2.84
2.55
2.45
2.10
3.06
2.09
0.93
2.30
2.29
2.18
1.91
1.79
1.64
0.33
0.34
0.26
0.18
0.17
0.17
0.15
0.11
0.15
0.26
0.09
0.09
0.09
0.08
0.08
0.07
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Both estimates are effectively the same, with the MCMC ones being very slightly larger in
absolute magnitude.
Question 6
Have all three higher-level variances had sufficient monitoring changes?
- 84 -
- 85 -
All three estimates have an Effective Sample Size of close to or substantially over 500; this
should be sufficient to characterize their distributions.
- 86 -
14. Longitudinal analysis of repeated measures data
Introduction
Recent decades have seen remarkable advances in methods for analysing longitudinal data.
Classical time series analysis has a long and rich history and was originally developed for
lengthy and detailed series such as the annual temperature record for central England for
some hundreds of years; or the value of stocks and shares on a daily basis. This approach
was developed typically for a single object hence the jibe that it is really the analysis of a
sample of one! In contrast we are dealing with multiple time-based measures for multiple
subjects. Some paradigmic cases are labour economics where one might observe 30 years of
annual income for a thousands of individuals (large N panels); biometric growth studies
where some hundreds of children may be measured on 5 occasions; and comparative
political economy where there may be thirty years of observations for twenty-five countries
(small N panels or temporally dominated data). In the social sciences, this is known as panel
data, in medicine they are known as repeated measures, and in economics and political
economy as cross-sectional, time-series data. The distinctive features are that data on
repeated occasions are naturally clustered within individuals, the ordering of the data
means something so that observations can be later or earlier in time, and there is usually
dependence so that later observations are related to earlier ones. While classical time series
was concerned to model variation over time and particular care was taken to model
faithfully dependence over time (serial correlation), the models that we are going to
examine aim to account for this variation and how it varies between subjects be they people
or countries. With repeated measures on individuals we can capture within-individual
change, that is longitudinal effects, and separate them from differences among people in
their baselines values – cross-sectional values. Both may be of substantive interest so that
the outcome may respond rapidly to changing values for individuals but there could also be
outcomes where the response is more closely related to the enduring nature of the
subjects.
The research problem
We have deliberately chosen a rather straightforward main example in which children are
repeatedly measured in terms of aerobic performance over a period to assess and
understand their physical development. It is straightforward in that the outcome can be
treated as a Normal distribution and that the research problem has a strict hierarchical
structure in that repeated measures are seen as nested within children who are nested
within schools. Aerobic performance is measured by a 12 minute runs test - the distance
someone can run or walk in 12 minutes. The test was developed as an easy-applied method
for assessing cardiovascular endurance. The measure is an important one in its own right for
children and adolescents, and additionally it is known to be a key determinant of adult
- 87 -
health. Ruiz et al (2009) reviewed seventeen high-quality studies and concluded that there
was strong evidence that higher levels of cardiorespiratory fitness in childhood and
adolescence are associated with a healthier cardiovascular profile later in life in terms of
such risk factors as abnormal blood lipids, high blood pressure and overall and central
adiposity.9 This study aims to model aerobic performances as children age and develop and
to consider what are the demographic characteristics that are associated with this change.
The original study was undertaken in Madeira at a time of rapid and substantial social and
economic change.10 The data that we are using are realistic in that they are based on the
Madeira Growth Study but we have simplified the data somewhat as that study is based on
an accelerated growth design in which repeated measures are taken of multiple overlapping
cohorts. We consider the appropriate analysis for such data with multiple cohorts in the
next chapter.
Throughout we will be referring to individual or subject- specific change as our
example is based on development in children, but there is no reason why this cannot be a
firm or indeed a country. While the example we are using is about development and growth
the same procedures are used to model decline, the economist’s negative growth.
Three types of longitudinal data analysis
There are basically three types of approach to modelling change.
Marginal approaches: this focuses exclusively on simply getting the general trend (for
everybody or subgroups) as the alternative name of population-average models makes
clear. With such models the nature of the dependency over time is not seen as of
substantive interest and is treated as nuisance that gets in the way of good estimates of the
trend. Consequently individual differences are averaged over and not explicitly modelled.
One method of estimation is to totally ignore the structure of occasion nested within
individuals and just fit means and treat observations as if the data are independent over
time; this can be achieved with standard OLS regression. The trouble with this is that the
standard errors are affected by dependency, and while estimates are consistent
(approaching the true population parameters as sample size increases), they are not
efficient (they are not the estimates with smallest possible sampling variance).
Consequently, a more efficient procedure known as Generalized Estimating Equations (GEE)
is often deployed in which a ‘working correlation matrix’ is defined for the dependency and
9
Ruiz, J.R. , Castro-Piñero, J., Artero, E.G. , Ortega, F.B., Sjöström, M., Suni, J., Castillo, M J (2009) Predictive
validity of health-related fitness in youth: a systematic review, Br J Sports Med 2009, 43, 909-923.
10
Freitas, D L Gaston Beunen, José A. Maia, Johan Lefevre, Albrecht L. Claessens, Martine Thomis, Élvio R.
Gouveia, Jones, K (2011) Body fatness, maturity, environmental factors and aerobic performance: a multilevel
analysis of children and adolescents, mimeo
- 88 -
is used iteratively in the estimation process.11 The advantage of this procedure is that it is
relaxed about distributional assumptions; it does not require an explicit specification of a
statistical model. The disadvantages are that the procedure is not readily extendible to
more complex models, there is no way within the standard GEE procedure to make formal
inferences about the dependency structure over time, and you cannot determine how
between-subject heterogeneity may be changing.
Subject-specific models are a flexible approach that can capture in a reasonably
parsimonious way the complexity of growth processes as it occurs in individuals. The focus is
on modelling individual growth trajectories with explicit terms for both the average
trajectory and how individuals depart for this. Specification comes in two forms so that in
addition to the fixed terms for the overall average trajectory, there can be a fixed term for
each individual or random terms for them. In the latter formulation it is also possible to
include subject-level variables to try and account for the departures from the general
trajectory. This random-effects formulation is the multilevel or mixed longitudinal model.
These random effects not only capture individual differences in growth trajectories but also
simultaneously induce residual dependence within individuals over occasions which is
thereby modelled. This is the reason why this model is also known as the conditional model
as we estimate the fixed part averages while taking account of the random effects. In the
Normal-theory model the marginal and conditional model will generally give very similar
results but this is not generally the case when discrete data are analysed; the difference is
greatest when there are large differences between individuals. It is possible (and MLwiN has
facilities for this) to derive the marginal from the conditional but not vice-versa.
Autoregressive or lagged-response models in which the outcome depends on the response
on a previous occasion so that the same values (albeit lagged) can appear on both sides of
the equation. These more complex models, known as transition models in the bio-statistical
literature and dynamic or state –dependence models in econometrics should in general only
be used if dependence on previous responses is of substantive interest, or to put it another
way, lagged effects should have the possibility of a causal interpretation. An example might
be that the number of infected people (response at t) in an area this week might depend on
the number of infected last week (lagged predictor at t-1). A very approachable applied
account of these models is given in Schmid (2001). 12
The sole focus here is on the subject-specific approach where there is an explicit
model with random terms for each person. Before proceeding any further it is helpful to get
the feel for subject-specific approach that will be used. Figure 1 shows the three
11
For a hand-worked example see Hanley JA, Negassa A, Edwardes MD, Forrester JE (2003) Statistical analysis
of correlated data using generalized estimating equations: an orientation. Am J Epidemiol 157:364–375.
12
Schmid, C H (2001) Marginal and dynamic regression models for longitudinal data, Statistics in Medicine 20,
3295-3311.
- 89 -
components of the two-level model (occasions nested within individuals) in a schematic
way. The response could be a sum score for 13 questions on happiness and we have
measured this weekly over a two year period. The three components are, with Crowder and
Hand’s, 1990, rather colourful descriptors, given in Table 1.13
Table 1 Three components of the two-level growth model
Term in the model What is shown
Crowder and Hand
Fixed part with
A generally increasing secular trend of
‘immutable constant
linear trend
happiness, a rising population average
of the universe’
Random effects for Person 2 is consistently above the trend;
‘the lasting
individuals
Person 1 is consistently below; the
characteristic of the
heterogeneity between individuals (the gap
individuals’
between them) is increasing with time.
Needs random slopes for latter.
Random departure Both individuals have their good and bad
‘the fleeting
for occasion
days which seem ‘patternless’; there is no
aberration of the
evidence of either individual becoming more moment’
‘volatile’ with time. and one day unrelated to
the next
Figure 1 Three components of the subject specific model
13
Crowder, M J and Hand D J (1990) Analysis of repeated measures, Chapman and Hall, London.
- 90 -
What is meant by ‘time’?
In every subject-specific model for change, some metric of time has to be included in the
model as a predictor and used to define occasion. The specific choice of how time is
measured and specified depends largely on substantive research questions and commonly
there are several plausible metrics. In the initial endurance study we have used two forms
of time – the chronological age of the children at the planned time of measurement in
decimal years (a continuous measure which potentially will gave greater precision as it
contains more information) and the year in which the measurement was made (five discrete
periods). Other possibilities are age in months since the inception of the study, the birthyear cohort of when they were born or indeed the height in centimetres the children were
when endurance was measured. The measure of time has to make sense in the context of
the study. Detailed psychological or physiological response studies generated by digital data
loggers may operate in minutes and seconds while generational change may operate over
decades. The one technical requirement is monoticity – the measure cannot reverse
direction- so that a variable like height is a possible measure of developmental time – you
cannot become shorter, but weight is problematic – as you could become lighter as time
unfolds. A specific analysis, as we shall see, may use more than different measure of time in
different parts of the model so that discrete time in years may be used to define occasion,
while continuous age is used in the fixed and random part to model individual
developmental trajectories. It is also worth stressing that the dependent variable has to be
measured reliably and validly over time and must not change its meaning –its construct
validity- with time. While this is likely to be the case with hard measures like aerobic
performance it is much more demanding for outcomes such as skills, attitudes and
behavioural measurements.
What this chapter covers
The major sections of this chapter are as follows.





A general conceptual overview of the multilevel approach to modelling repeated
measures data. This overview aims to be brief but comprehensive and makes the
argument in terms of small illustrative examples and graphs.
We consider hierarchical subject-specific models and provide an algebraic
specification of the two and three level model using the random-effects approach.
This is followed by a more detailed consideration of alternative forms of
dependence in terms of alternative covariance patterns.
The models are then applied using MLwiN to the development of aerobic
performance during childhood and adolescence; we consider some determinants of
development and alternative covariance structures to model the dependency over
time.
It is shown how the marginal model estimates differ from the conditional estimates
when the response is discrete and how both can be obtained in MLwiN, and how
- 91 -

they differ in their interpretation.
The chapter concludes with a consideration of the fixed versus random effects
approach to longitudinal data in the context of robustness to endogeneity. This
section also considers the true role of the commonly misused Hausman test.
Given that researchers may wish just to use such growth models we have tried to make the
chapter relatively self-contained, but inevitably we use material that has been covered in
previous chapters.
What this chapter is not about (and where to look for help)
It is worth stressing that this chapter despite being long concentrates on the basics. Other
possibilities of the analysis of measurements over time include the following.

If you only have two time points we recommend simply using the standard model
specifying the dependent variable as the variable measured on the later occasion
and specifying the measurement on the first occasion as another predictor; you will
then be modelling change. This is a more flexible procedure than subtracting the
past from the current value and modelling a change score. This is the approach that
is used in the early chapters of the MLwiN user manual in which educational
progress is modelled by regressing Normexam on Standlrt.14

If you have repeated cohorts see Chapter 11 of volume 1 which considers the
changing performance of London schools as different cohorts of pupils pass through
the schools. This is the analysis when there are just two measurements on
individuals on entry and departure from the school. In the next chapter we will
illustrate what to do when there are multiple repeated measurements on multiple
cohorts.

In reality children can be expected to change schools over time, necessitating a nonhierarchical approach as children can be seen as ‘belonging’ to more than one
school. This development will not be considered here, but the MLwiN MCMC manual
(Chapter 16) considers such multiple membership models in which weights can be
included for the amount of time a child spent in a particular school.15 For such
models MCMC estimation is essential.

It is possible to analyse more than one outcome simultaneously; thus Goldstein
(2011) considers two responses of children as they are growing and the full adult
height.16 An example with two discrete outcomes would be one response of whether
the patient got better over time and simultaneously did they get affected by side
effects of the treatment. This can be achieved using a multivariate multilevel model
which is considered in the MLwiN User Manual (Chapter 14) and even more flexibly
in the Realcom software which can handle a variety of different types of response
14
Rasbash, J., Steele, F., Browne, W.J. and Goldstein, H. (2009) A User’s Guide to MLwiN, v2.1 Centre for
Multilevel Modelling, University of Bristol.
15
Browne, W.J. (2009) MCMC Estimation in MLwiN, v2.1. Centre for Multilevel Modelling, University of Bristol.
16
th
Goldstein, H (2011) Multilevel statistical models, 4 Edition, Wiley, Chichester.
- 92 -
(continuous and discrete) simultaneously.17 To avoid confusion, this multivariate
approach can also be used to model a single outcome with multiple occasions and
we shall use it here to model in a very flexible (but non-parsimonious) way complex
dependency over time.

While we shall be considering temporal dependence is some detail we will not
combine this with spatial dependence; such modelling is possible by using crossclassified and multiple membership models, see Chapter 17 of the MLwiN MCMC
Manual, and Lawson et al (2003).18

We must also stress that we are not considering the multilevel approach to event
history analysis where attention focuses on the time to an event be they single
(death) or multiple states (single/cohabiting/married/divorced). Materials on that
topic are available from the Centre for Multilevel Modelling.19

Another topic that is not covered (and is not available in MLwiN) is group trajectory
modelling in which the aim is to identify groups of children that have similar growth
trajectories (Nagin, 2005 ).20 This is achieved by assuming that the random effects for
children at the higher level follow a discrete distribution instead of the usual Normal
distribution. A linked multinomial model can then be used to try and account for
group membership. This approach has been implemented in SAS21 and could be
estimated in MPlus , LatentGold and GLLAMM for these programs have facilities for
non-parametric estimation of latent effects, the random child differences. The
approach has been used not just on individuals but on countries and areas to
ascertain distinctive trajectories of life expectancy and voting outcomes. 22

In all the models that follow, the quality of the estimates requires that the missing
observations do not represent informative dropout. That is the standard application
of the random-coefficient approach requires Missingness at Random (Rubin, 1976)
for the estimates are to be unbiased.23 That is conditional on the predictors in the
fixed part of the model, there should be no patterning to the missingness. A more
17
Information on Realcom is available from http://www.cmm.bristol.ac.uk/research/Realcom/index.shtml
Lawson, A B, Browne, W J and Vidal Rodeiro, C L (2003) Disease mapping with WinBUGS and MLwiN. Wiley
& Sons, New York.
19
http://www.bristol.ac.uk/cmm/software/support/workshops/materials/eha.html
20
Nagin, D. S. (2005) Group-based Modeling of Development. Harvard University Press, Cambridge, MA.
21
Jones, B., Nagin, D. S., & Roeder, K. (2001). A SAS procedure based on mixture models for estimating
developmental trajectories, Sociological Research and Methods, 29, 374-393.
22
Jen, M-H, Johnston, R Jones, K et al (2010) 'International variations in life expectancy: a spatio-temporal
analysis,. Tijdschrift voor Economische en Sociale Geografie, 101, 73-90; Johnston, RJ, Jones, K & Jen, M. (2009)
Regional variations in voting at British general elections, 1950-2001: group-based latent trajectory analysis,
Environment and Planning A, 41, 598-616.
23
Rubin, D B, 1976, Inference and missing data Biometrika 63, 581-592.
18
- 93 -
sophisticated approach involving imputation is also available and has been
implemented in MLwiN; for a tutorial see Goldstein, H (2009).24

The chapter concentrates on Normal-theory outcome models but it is possible to use
MLwiN to analyse counts (in Poisson and Negative Binomial models) and discrete
outcomes (in binomial and multinomial models). A key issue is that, and unlike the
Normal-theory case, the results for the fixed part of the random-effects model do
not give the population average value directly but they can be obtained. At the end
of this chapter we discuss this issue further and in the next chapter we show some
examples of these models in the analysis of changing gender ideologies in the UK
and happiness in the USA. With these models we will also consider how it is possible
to analyse age, period and cohort effects simultaneously.
A conceptual overview of the random-effects, subject-specific approach
A great deal has been written in recent years on the random-effects approach to
longitudinal data analysis since the foundational statement of Laird and Ware (1982).25 In
addition to the treatment of this approach in specific chapters of the standard multilevel
books (Raudenbush and Bryk, 2002 Chapter 6; Goldstein, 2011, Chapter 5; Hox, 2011,
Chapter 5; Snijders and Bosker, 1999, Chapter 12)26, there are a number of excellent booklength, introductory-to-immediate treatments largely dedicated to the random-effects
approach. These include Singer and Willet (2003); Fitzmaurice et al (2004); and Hedeker
and Gibbons (2006).27 More advanced book-length treatments include Verbeke and
Molenberghs(2000, 2006; the latter concentrates on discrete outcomes) and Fitzmaurice et
al (2009); a handbook that covers the field more broadly.28 Focussed, highly recommended
expository articles that pay particular attention to practical issues include Singer (1998),
Maas and Snijders (2003), Cheng et al (2010), Goldstein and Stavola (2010).29
24
Goldstein, H (2009) Handling attrition and non-response in longitudinal data, Longitudinal and Life Course
Studies1,63-72. Excellent practical advice on missing data can be found at
http://www.lshtm.ac.uk/msu/missingdata/index.html
25
Laird NM, Ware JH. (1982) Random-effects models for longitudinal data. Biometrics 1982; 38:963-974
26
Hox, J (2011) Multilevel analysis: techniques and applications, Second Edition, Routledge, New York;
nd
Raudenbush, S and Bryk, A. (2002) Hierarchical linear models: applications and data analysis methods. 2 ed..
Sage Publications; Newbury Park CA; Snijders, T. and Bosker, R. (1999) Multilevel analysis: an introduction to
basic and advanced multilevel modeling. Sage Publications; London.
27
Singer, J D and Willet, J B (2003) Applied longitudinal data analysis: modeling change and event occurrence,
Oxford |University Press, New York; Fitzmaurice, G M , Laird, N M and Ware, J H (2004) Applied longitudinal
analysis, Wiley, New Jersey; Hedeker, D and Gibbons, R D (2006) Longitudinal data analysis, Wiley, New Jersey.
All three book have very helpful websites which include data, software, power point presentations
:http://gseacademic.harvard.edu/alda/; http://biosun1.harvard.edu/~fitzmaur/ala/; and
http://tigger.uic.edu/~hedeker/long.html
28
Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data, Springer, New York;
Molenberghs, G and Verbeke, G (2006) Models for discrete longitudinal data Springer, Berlin; Fitzmaurice, G,
Davidian, M, Verbeke, G and Molenberghs, G (eds.) (2008) Longitudinal Data Analysis, Chapman &
Hall/CRC,Boca Raton.
29
Singer, J D (1998) Using SAS PROC MIXED to fit multilevel models, hierarchical models, and Individual growth
models Journal of Educational and Behavioural Statistics. 23, 323-355; Maas, CJM and Snijders, TAB (2003)
- 94 -
This wealth of material reflects the fact that random-effects multilevel model has become
the paradigmic method for the analysis of repeated measurements.30 We think that the
reasons for the widespread adoption of this approach are six fold.
1 Repeated measures data are naturally hierarchical and can be extended to more
levels
As Figure 2 shows, we can readily conceptualise a simple repeated measured design with
repeated measures of aerobic performance as being level 1 observations hierarchically
nested within individual children at level 2. Note the imbalance which reflects missing
values so that not all children are measured on all occasions. This can be extended to
include further structures so that children are seen as belonging to schools at level 3 and
again there can be imbalance with a potentially differing number of children observed in
each school.
Figure 2 Repeated measures as a hierarchical structure
1
Schools
Children
Occasion
1…
1…5
2…
10
1 …
1…4
1…5
15
1…5
A typical dataframe for this study is shown in Tables 2 and 3. The first table gives the ‘wide’
form in which columns represent different occasions, so that column 1 gives the unique
child identifier, followed by 5 columns of endurance (the distance covered in 12 minutes) on
5 separate occasions. The next five columns give chronological age of the child in decimal
years at the time when they were measured and this is followed by the sex of the child and
a School identifier. Missing values are shown by *.
The multilevel approach to repeated measures for complete and incomplete data Quality & Quantity 37: 71–
89; Cheng, J, Edwards, LL, J, Maldonado-Molina, M M, Komroc, K A and Keith E. Muller, K E (2010) Real
longitudinal data analysis for real people: Building a good enough mixed model, Statistics in Medicine, 29, 504520; Goldstein, H and de Stavola, B (2010) Statistical modelling of repeated measurement data Longitudinal
and Life Course Studies, 1, 170 –185.
30
For an apparently dissenting voice see Allison, P D (2009) Fixed Effects Regression Models, Quantitative
Applications in the Social Sciences Series, Sage, Thousand Oaks, California. We will return to this issue at the
end of the chapter.
- 95 -
Table 2 An extract of the endurance data in wide form
Uniqu End End End End End Age Age Age Age Age Sex Schoo
e
1
2
3
4
5
1
2
3
4
5
l
1 19.1 26.0 26.8 26.0 30.3
7.3
9.2 11.3 13.7 15.8 Bo
1
y
2 16.0 17.8 15.0 18.4 17.9
7.1
9.4 11.4 13.1 15.4 Girl
1
3 16.9 22.4
* 26.3 30.7
7.8
9.8 11.8 13.9 16.1 Bo
1
y
4 17.8 18.7 23.7 23.4 26.6
7.7
9.6 11.6 13.7 15.9 Bo
1
y
5 23.9 25.0 26.8 27.4 29.6
7.4
9.5 11.2 12.7 15.0 Bo
1
y
The form that is needed for multilevel modelling is the ‘vectorised’ or long form shown in
Table 3. The first column gives the unique child identifier which will be the level 2 identifier
in the model. This is followed by the occasion identifier which here is the calendar year for
the planned measurement occasion; clearly the intention was to observe the children every
two years. This will be the level 1 identifier, while School will be the level 3 identifier. This
structure in its two level version is sometimes known as the person-period data set as
compared to the wide from as the person-level dataset. The data must be sorted in this
form for correct estimation in MLwiN so that all sequential observations are next to each
other. The remaining three columns give the endurance measure, the age of the child when
measured and their sex. The row with the missing measurement (here child 3 in 1998) has
simply been deleted. As we shall see, two operations have been used in the move from
Table 2 to 3, the time varying observations (such as Endurance and Age) have been
‘vectorised’, while the time-invariant observations (such as Sex) have been ‘replicated’.
Child
1
1
1
1
1
2
2
2
2
2
3
Table 3 The data in long format
Year School
Endur
Age
1994
1
19.1
7.3
1996
1
26.0
9.2
1998
1
26.8
11.3
2000
1
26.0
13.7
2002
1
30.3
15.8
1994
1
16.0
7.1
1996
1
17.8
9.4
1998
1
15.0
11.4
2000
1
18.4
13.1
2002
1
17.9
15.4
1994
1
16.9
7.8
- 96 -
Sex
Boy
Boy
Boy
Boy
Boy
Girl
Girl
Girl
Girl
Girl
Boy
3
3
3
3
1996
1998
2000
2002
1
1
1
1
22.4
*
26.3
30.7
9.8
11.8
13.9
16.1
Boy
Boy
Boy
Boy
2 The fixed part of the model estimates the general trend over time
The fixed part of the model generally consists of polynomials of age (or some metric of time)
and interactions which allow us to model trends. To take a simple example, the quadratic
equation of Age:
yi   0  1 Agei   2 Agei2
(1)
can take on the four different forms of Figure 3 depending on the different values of the
linear and quadratic parameters as given in Table 4. The rule is that the number of bends or
inflection points is one less than the order of the polynomial so that that the linear equation
has no bends and the quadratic has 1. More complex shapes can be captured by a cubic
polynomial which allows two bends. More flexible shapes can be achieved by replacing
polynomials by splines.31
Figure 3 Four alternative forms from a quadratic relationship (Hedeker and Gibbons, 2006)
31
Pan, H., and Goldstein, H. (1998). Multi-level repeated measures growth modelling using extended spline
functions. Statistics in Medicine, 17, 2755-2770.
- 97 -
Table 4 The parameter values for the four alternative forms of Figure 3
Intercept
Linear
Quadratic slope
Graph
Form
(β0)
Slope (β1)
(β1)
a)
Decelerating positive slope
2
2
-0.25
b)
Accelerating positive slope
2
2
+0.25
c)
Positive to negative slope
2
8
-1.2
d)
Inverted U shape
2
11
-2.2
3 The fixed part accommodates time-varying and time-independent predictors
As we saw in Table 3 there may time varying predictors such as Age and time-invariant
predictors such as Sex. From the multilevel perspective these are variables that are
measured at the lower (occasion) level and the higher (child) level respectively. It is
common practice for these variables to be included in the fixed part of the model both as
main effects and as interactions. If Sex (measured for each child at level 2 represented by
the subscript j) and Age (measured for each child on each occasion at level 1, represented
by the subscript ij) are included as interactions, they form a cross-level interaction. A full
two-level quadratic interaction is specified as follows:
yij   0  1 Ageij   2 Ageij2   3 Boy j   4 Boy j * Ageij   5 Boy j * Ageij2
(2)
Figure 4 shows some characteristic plots when the plot for Girls (the ‘reference’ category) is
the ‘decelerating positive slope’ one of Figure 3a. The values that were used to generate the
data are given in Table 5. Again it is clear that this straightforward specification is quite
flexible in the different types of trends it is able to handle.
Figure 4 Four alternative forms for a differential Age by Gender interaction
- 98 -
Table 5 The parameter values for the four alternative forms of Figure 4
Graph
a)
b)
c)
d)
Form of Gender gap
Constant differential
Linearly increasing
differential
Quadratic increasing
differential
Increasing then
narrowing gap
Intercept
for Girls
(β0)
2
2
Linear
Slope
for Girls
(β1)
2
2
Quadratic
slope for
Girls
(β1)
-0.25
-0.25
Differential
Intercept
for Boys
(β3)
4
4
Differential
Linear
Slope for
Boys (β4)
0
1
Differential
Quadratic
slope for
Boys(β5)
0
0
2
2
-0.25
4
0.5
1
2
2
-0.25
4
5
-0.7
4 The random effects approach allows complex heterogeneity in growth
trajectories
The fixed part gives the general trends across all children, the random part allows for
between children and within-children variations around this general line. Beginning with the
child higher-level and returning to the overall general trend of equation (1) we can add a
random child intercept (u0j) and a random slope departure (u1j) associated with the Age of
the child and assume that the random slopes and intercepts comes from a joint Normal
distribution:
yij   0  1 Ageij   2 Ageij2  (u0 j  u1 j Ageij )
u 0 j 
  u20

)
  ~ N (0, 
2 
u1 j 
 u 0u1  u1 
(3)
This specification is in effect allowing each child to have their own distinctive growth
trajectory but we are not fitting a separate line for each child (as we would in a fixed-effects
specification with a dummy and age interaction for each and every child) but rather each
child’s trajectory is seen as an allowed to vary departure from a general line.
Figure 5 shows a number of different characteristic patterns that can be achieved
with different values of the level-2 variance-covariance matrix when the underlying trend is
the decelerating positive slope of Figure 3a. Each of the graphs on the far left show the
trajectories for six children around the general trend with Age on the horizontal axis of time
since baseline which is given a value of zero.
- 99 -
Figure 5 Varying relations for growth models
Table 6 Interpreting the form and parameters of Figure 5
Graph
a)
b)
c)
d)
Interpretation
Intercept
variance(
)
Differences at baseline maintained as
children grow older
Small differences at baseline become
accentuated as children grow older
Large differences at baseline attenuate
as children grow older
Differences at baseline unrelated to
subsequent development growth
Slope
Variance (
Covariance
)
(
)
Yes
No
Yes
Yes
Positive
Yes
Yes
Negative
Yes
Yes
Zero
Taking each graph in turn we get the interpretations and associated values for the variance–
covariance terms as shown in Table 6. (These graphs are of course simple variants on the
varying relations plots we have used extensively in these volumes.) Notice that the sample
covariance summaries the association between status at when time is zero (typically
baseline or average age) and the rate of change.
- 100 -
Turning now to the within-child, between-occasion variation we can add a random
intercept (e0ij) and a random departure (e1ij) associated with the Age of the child and assume
that these departures comes from a joint Normal distribution:
yij   0  1 Ageij   2 Ageij2  (u0 j  u1 j Ageij )
u0 j 
  u20

)
  ~ N (0, 
2 
u1 j 
 u 0u1  u1 
(4)
e0ij 
  e20

)
  ~ N (0, 
2 
e1ij 
 e 0e1  e1 
This specification allows differential occasion departures around the child-specific line.
Figure 6 shows a number of different characteristic patterns that can be achieved by
different values of the level-1 variance-covariance matrix when the underlying child
relationship is again the positive decelerating one. Table 6 gives the interpretation and
nature of the associated parameters.
Figure 6 Plots showing three plots differentiated by level 1 characteristics
Graph
a)
b)
c)
Table 7 Interpreting the form and parameters of Fig 6
Interpretation
( )
(
Homoscedasticity: constant variability
between occasions around general trend
Heteroscedasticity: increasing variability
between occasions around general trend
Heteroscedasticity: decreasing variability
between occasions around general trend
(
)
)
Yes
0
-
Yes
Yes
Positive
Yes
Yes
Negative
Another way to characterise the random part is to plot the variance function at each level. A
quite complex example is shown in Figure 7 which examines the trend in life satisfaction
over time in Germany with a differential trend in the East and West. The original data
- 101 -
structure was 15 occasions for 16k individuals in 16 Länder over the period 1991-2006 with
Life satisfaction on a 10 point score
Figure 7 Changing life satisfaction in East and West Germany 1991-2006
The graphs show
a) National trend: decline throughout the period in the West; initial improvement in
East; then decline; the East is always less satisfied with life than the West;
b) Between Lander variance: very small differences between Lander around the
national trend;
c) Within Lander, between individual variance: over the period there is at first greater
equality in between-individual Satisfaction but then the heterogeneity between
individuals grows in both East & West, this must mean that not all have shared the
decline is Satisfaction of the general trends, while for others it is more marked The
West is consistently slightly more unequal.
d) Within individuals over time: in East there is a decline in volatility to below the levels
of the West, individual Life Satisfactions have become more consistent over time;
individuals have become more ‘stuck in their groove’.
Figure 8 shows the model specification in MLwiN that was used to fit the model.
- 102 -
Figure 8 The model used to estimate the changes in German Life Satisfaction
It is clear that the fixed part is a cubic polynomial of time and East/West interactions so that
both parts of the country could have their own general trend with 2 possible bends. Each of
the variance functions consists of a variance term for when Time is zero ( eg the between
Lander Variation,
), a variance for Linear Time ( ) and a differential variance for the
East ( eg
) and associated covariances so that a quadratic function of Linear Time is
allowed in both East and West at all three levels .
5 The random part models the dependency in the response over time
The distinctive feature of repeated measures data is that the response on one occasion is
likely to be related to the response on another occasion. That the response is auto- or selfcorrelated, or dependent. Thus, for example, height of a child on one occasion is likely to be
highly related to height on a previous occasion. Similarly, we could expect the income of
individuals to be moderately dependent, while voting choice is quite dependent with quite a
bit of churning over time. Substantively we want to know this degree of autocorrelation to
see the extent and nature of the volatility in the outcome. Technically, dependent data do
not provide as much information as you think. Thus, if there are 300 children and 5
measurement occasions there is not 300* 5 independent pieces of information, but
somewhat less. It is well known that the model that does not take account of dependency
will result in the overestimation of the standard errors of time-varying variables (such as
Age) and underestimation for time-invariant variables (such as Sex).
- 103 -
Table 8 The correlation for Endurance on different occasions
Y1
Y1
Y2
Y3
Y4
Y2
0.65
Y3
0.69
0.72
Y4
0.69
0.71
0.76
Y5
0.69
0.70
0.77
0.81
In order to get some feel for this dependency consider again the data in the wide form of
Table 1. If we correlate the responses on each of the 5 occasions (Y1 to Y5) we get the
correlations in Endurance for children on each and every pair of occasion as shown in Table
8. Clearly there is quite a strong correlation of greater than 0.65 which must be taken into
account in the modelling process. To get some feel for the reduction in the degrees of
freedom we can use the equation for the effective sample size ( neff )32
neff = Total number of observation / (1 + ρ* (m-1))
(5)
Where ρ is the degree of dependency over time (say, 0.65) and m is the size of the cluster
(here typically 5 occasions are nested within each individual), so that the equations
becomes
neff = 1500 / (1 + 0.65* (5-1))
= 417
(6)
We do not have 1500 observations but substantially less than a third of that.
This degree of correlation can change as we model or condition on the fixed part of
the model, so it is vital to model explicitly the degree of residual autocorrelation or
dependency. As we shall see in more detail later we can have a number of different patterns
of correlation. The simplest possible form of dependency is the so-called compound
symmetry approach (this turns out to be the standard random-intercepts model) in which
the degree of correlation is unchanging with occasion. This means that the correlation is
assumed to be the same between occasion 1 and 2 (a lag of 1), between occasion 1 and 3 (a
lag of 2), between 2 and 3 (another lag of 1) and indeed each and every occasion. The most
complex possible form of dependency is called unstructured, and this permits a different
correlation between each and every lag. Both of these types of dependency structure can
be modelled in MLwiN as well variants that lie between these two extremes.
To get a feeling for this dependency consider gain the three components of the
subject-specific model shown in Figure 9. In comparison to Figure 1 the population average
trend and the subject specific trends are the same but the within person between occasion
values show much more structure than the ‘white noise of the earlier figure. That is there is
a tendency for a positive happiness in a particular time to be followed by another positive
32
Cochran, W G (1977). Sampling Techniques (Third edition), Wiley, New York.
- 104 -
score; similarly a negative score below the subject-specific line tend to be followed by
negative scores.
Figure 9 Three components of the subject-specific with serial dependency between
occasions
Figure 10 Auto-correlogram for Figure 9 and Figure 1
Thus there is clear marked positive autocorrelation for each person in that points
close together in time tend to be differentially high or low. This type of random variation is
usually a decreasing function of the time separation between measurements (the lag) and
this can be highlighted in an autocorrelation plot where dependency is plotted against lag.
Figure 10a shows a characteristic plot marked serial correlation with its defining
dependency with lag; while 10b shows the auto-correlogram for independent white noise.
- 105 -
Here because of the lengthy time sequence we are able to estimate the degree of
correlation over a number of lags. Such dependency is of substantive interest and has to be
modelled properly to get correct standard errors of the fixed part coefficients.
6 The random-effects approach allows unbiased estimation even with missing
data
The random-effects model has the valuable property that missing values on particular
occasions can be simply dropped without biasing the estimates (Goldstein and de Stavola,
2010). 33 This allows the efficient use of all the information that we have; we do not need to
omit children who do not have other than a complete set of observations. This property is
based on the assumption of Missing at Random (Little and Rubin, 2002)34 so that the
missingness itself is un-informative and the non-response process depends only observed
but not on unobserved variables, and in particular that the probability of missing does not
depend on responses we would have observed had they not been missing (eg a study of
alcohol consumption, the heavy drinkers not showing up for appointments). The assumption
is a reasonably forgiving one and applies not to the response but to the response
conditional on the fixed part, that is the residuals. Thus, if the dropout is solely due to
failure to be measured by older girls, and terms for gender and age interactions are included
in the fixed part of the model, the estimates can be expected to be unbiased. Moreover,
with only a few missing values and not severe imbalance, any imparted bias is likely to be
small (Maas and Snijders, 2003).
While we cannot generally asses the assumption conclusively as we lack the very
data we need, it is possible to use a pattern-mixture approach to give some insights into
what is going on. The procedure works by identifying a small number of patterns of
missingness ( e.g. those missing on occasion 2 and 3; those missing on 4 and 5) and then
using dummy variables to represent different patterns of missingness in a stratified analysis.
The use of this approach in multilevel analysis is explained by Hedeker, and Gibbons (1997)
and applied by Hedeker and Rose (2000)35. If the results are not MAR we have three main
options: pattern-mixture models, multiple imputation36, and selection models in which you
33
The time trend for subjects with missing observations is estimated by borrowing strength from subjects with
similar characteristics.
34
Little, R. J. A. and Rubin, D (2002) Statistical analysis with missing data, 2nd ed. New York: John Wiley .
35
Hedeker, D., and Gibbons, R.D. (1997). Application of random-effects pattern-mixture models for missing
data in longitudinal studies. Psychological Methods, 2, 64-78; Hedeker, D., and Rose, J.S. (2000), The natural
history of smoking: a pattern-mixture random-effects regression model. In: Rose, J.S., Chassin, L., Presson, C.C.,
and Sherman, S.J. (Eds.), Multivariate Applications in Substance Use Research, p. 79-112. Lawrence Erlbaum
Associates, Mahwah, NJ.
36
Goldstein, H (2009) Handling attrition and non-response in longitudinal data, Longitudinal and Life Course
Studies1,63-72.
- 106 -
fit a model for the complete data and another model for the selection model that gives rise
to missingness.37
Algebraic specification of random effects model for repeated measures
This section provides a detailed step-by-step specification of the Normal-theory two and
three-level model.
Two-level model: random intercepts
We start with the micro-model for occasions within children
yij  0 j x0ij  1 x1ij  e0ij x0ij
(7)
where the observed variables are
yij
the response of the distance covered in hundreds of metres on occasion i by
child j;
x1ij
the age of the child on each occasion, that is age is a time-varying variable.
We chose to centre this value around its grand mean across all occasions and
all children to produce a more interpretable intercept and to aid model
convergence (Snijders and Bosker, 1999,80)38;
x0ij
the Constant, a set of 1’s for each and every child on all occasions.
The parameters to be estimated are
1
the general slope across all children, this gives the change in distance as
children become one year older, a general measure of development;
0 j
the intercept for each child; because we have centred age around its grand
mean, this gives an estimate of how much each child would have run at that
age.
This specification of the model essentially fits a regression line for each and every child in
such a way that they have the same slope – the linear increase with age - but a different
intercept, the amount they would achieve at the average age of the sample. This is a parallel
37
See Rabe-Hesketh, S (2002) Multilevel selection models using gllamm at
http://ideas.repec.org/p/boc/dsug02/1.html
38
Snijders, T.; Bosker, R. (1999) Multilevel analysis: an introduction to basic and advanced multilevel modeling.
Sage Publications; London.
- 107 -
lines’ approach in which children differ but this difference does not change with age. The
equation for the estimated lines is
^
^
^
y ij   0 j x0ij   1 x1ij
(8)
Consequently, the unexplained random part of the micro-model is
e0ij
which gives the differences from the child-specific line on each occasion; it
can be seen in equation (7) that this unobserved latent variable is associated
with the Constant.
The associated macro model is the between-children model and can be specified as follows:
0 j  0   2 x2 j  u0 j
(9)
The observed variable is
x2 j
which is a time-invariant variable for each child which here is a dummy
indicator variable with a 1 for a girl and 0 for a boy.
The additional parameters to be estimated are
0
2
the grand-mean intercept which, because of the indicator dummy, gives the
mean distance achieved by the reference category of a boy of average age;
the differential achieved by a girl on average
The unexplained random part of the model is
u0 j
which gives the differential intercept for each child which remains after
taking the account the gender of the child.
Substituting equation macro equation (9) into micro equation (7) we obtain the combined
model:
yij  0 x0ij  1 x1ij   2 x2 j x0ij  (u0 j x0ij  e0ij x0ij )
(10)
The ’s are the fixed part of the model and give the averages. Here, is the mean distance
achieved by a boy of average age, is the linear change in distance if the child (of either
sex) is one year older, and is the girl-boy differential at all ages. The random part is
composed of two parts;
which is the unexplained differential achievement for a child
- 108 -
given their age and gender; and
is the unexplained occasion-specific differential given
the child’s age, gender and differential performance. Distributional assumptions complete
the model. Here we assume, given the continuous measurement of performance, that both
the level 1 and level 2 random terms follow a Normal distribution and that there this no
covariance between occasion and child differentials:
u  ~ N ,0(
0j
2
u0
);
e  ~ N ,0(
0ij
2
e0
);
Cov[u0 j , e0ij ]  0
(11)
The variance at level 2,  u20 , summarizes the differences between children conditional on the
terms in the fixed part of the model. If we include additional terms at either occasion or
child level (that is time varying or time invariant) and they are good predictors of the
response, we would anticipate that this value would become smaller. The variance at level
1,  e20, summarizes the between occasion differences. This can only reduce if we include
time-varying variables that are good predictors of the distance that has been run.
The combined model of equation (10) is known as random-intercepts model as the
allowed-to-vary intercept is treated as a latent or random variable and the child differential,
u0 j
, is assumed to come from a common distribution with the same variance. Thus, in the
random-effects specification the target of inference is not each child but the between-child
variance. If
is positive the total residual ( +
) will tend to be positive leading to
greater endurance than predicted by the covariates; if the random intercept is negative, the
total residual will also tend to be negative. Since
is shared by all responses for the same
child, this induces within-child dependence among the total residuals. The larger the level 2
variance to the total variance (this is what ρ measures) the relative greater this similarity
effect will be and the greater the dependence.
Two-level model: varying slopes
The structure of the basic model can be extended in a number of other ways. An important
development is the random-slopes model in which the slope in the micro-model term is
additionally indexed so as allow each child to have their own development trajectory
yij   0 j x0ij  1 j x1ij  e0ij x0ij
(12)
There are now two macro-models, one for the allowed to varied intercepts:
0 j  0   2 x2 j  u0 j
(13)
and one for the allowed to vary slopes:
1 j  1  3 x2 j  u1 j
When both macro-models are substituted into the micro-model to form the combined
model
- 109 -
(14)
yij   0 x0ij  1 x1ij   2 x2 j x0ij   3 x2 j x1ij 
(u0 j x0ij  u1 j x1ij  e0ij x0ij )
(15)
a cross-level interaction is required in the fixed part between time-varying age and timeinvariant gender. As before the ’s give the averages; is the mean distance achieved by
an average-aged boy, is the linear change in distance if boys are one year older, is the
girl-boy differential at the average age, and is the differential slope for girls in
comparison to boys.
The random part is now composed of three parts;
which is the unexplained
differential achievement at average age for a child;
is the differential child-specific slope
and
is the unexplained occasion-specific differential given the child’s age, gender and
differential performance. The distributional assumptions are
u0 j 
  u20

)
  ~ N (0, 
2 
u1 j 
 u 0u1  u1 
e  ~ N ,0(
0ij
2
e0
)
(16)
So that the level-2 random intercepts and slopes are assumed to come from a joint Normal
2
distribution with  u20 being the variance of the intercepts,  u1 being the variance of the
intercepts, and  u 0u1 being the covariance of the intercepts and slopes. The total variance
between children at level 2 is then given by a quadratic variance function (Bullen et al,
1997)39
Var (u0 j x0ij  u1 j x1ij )   u20 x02ij  2 u 0u1 x0ij x1ij   u21 x12ij
(17)
This specification can accommodate variation between children increasing over time,
decreasing and remaining steady as we have seen in Figure 5.
Non-linear growth curves
In the above specification the underlying growth curve has been specified to have a linear
form. The simplest way to specify a non-linear curve and thereby allow development to be
non-constant is to include a polynomial of age in the micro model, such as the square of age
yij   0 j x0ij  1 x1ij   2 xij2  e0ij x0ij
(18)
it is also possible to allow the parameters associated with age to be random at level 2
yij  0 j x0ij  1 j x1ij  2 j xij2  e0ij x0ij
(19)
so that this model allows ‘curvilinearity’ at both the population and individual levels. In this
model
39
Bullen N, Jones K, Duncan C (1997) Modelling complexity: analysing between-individual and between-place
variation - a multilevel tutorial, Environment and Planning A 29(4) 585 – 609.
- 110 -
is the performance for person j at centred age;
is the linear change with age for person j and this is known as the instantaneous rate
of change;
is the term associated with quadratic of age for person j, it is sometimes known as
the curvature; a positive estimate for this new term would indicate acceleration of
growth (and a convex curve) while a negative value would indicate deceleration (and
a concave curve).
The choice between equations (18) and (19) is often an empirical one and what
degree of complexity is supported by the data. In general it is perfectly acceptable to have
quite complex fixed effects while only allowing the lower order terms such as the linear one
to vary randomly between subjects; indeed the linear term for age specifies a quadratic
variance function as we saw in equation (17).
It is a simple matter in both quadratic models to calculate the point at which the
trend flattens out, the inflection or stationary point, through the derivative, and this can be
done across the complete growth trajectory
(20)
which equals zero for
(
)
(21)
This may of course be outside the range of the data.
Other forms of growth can be fitted so that a more ‘curvy’ result is possible by
adding a cubic term (so that there are two inflection points) while more complex models
dealing with growth spurts or growth falling off as maturity is reached may require
something more complex than polynomials. Splines are a particularly flexible choice. A key
paper is Verbyla et al (1999) who show that a mixed methods (ie random coefficient)
methodology can be used to derive an optimal smoothing parameter for estimating the
curviness of the splines derived from a single additional variance component.40 Here given
the continuous growth that can be expected for physical measures of performance, it is
unlikely that we will need more complex forms such as piecewise relations which can
accommodate a sudden jump in development (Snijders and Bosker, 1999,187;Holt, 2008).41
There are also intrinsically non-linear models that cannot be transformed to normality.42
40
Verbyla, P., Cullis, B.R., Kenward, M.G. and Welham, S.J. (1999) The analysis of designed experiments and
longitudinal data using smoothing splines. Applied Statistics, 48, 269-312
41
Holt, J.K. (2008) Modeling growth using multilevel and alternative approaches In A.A. O'Connell & D.B.
McCoach (Eds.) Multilevel Analysis of Educational data, IAP, Charlotte , 111-159.
42
Davidian, M., and Giltinan, D.M. (1995) Nonlinear models for repeated measurement data. New York,
Chapman Hall, and also chapter 9 of Goldstein (2011).
- 111 -
Dummy variables can also be added to the equation to mark particular time-based
processes so that in a detailed study of happiness on an hourly basis, a dummy could signify
the weekend, and another dummy Monday morning! Another example is to include a
dummy for the first measurement occasion in addition to time trend so as to model a
‘learning effect’ from having completed the task once.
Modelling dependence in the random intercepts and slopes model
The random-intercepts specification of equation (10) implies a degree of correlation
between the children’s measure on different occasions. Based on the assumptions of
equation (11), the covariance between child j on occasion 1 and 2 (Goldstein,2011,19) is
(
)
(
)
(22)
Consequently, the degree of correlation between occasions is given by

 u20
 u20   e20
(23)
which is the usual ratio of the between-child variance to the total variance. This type of
dependency is known as compound symmetry and constrains the correlation to be the same
between any pair of occasions. This model therefore imposes a block-diagonal structure on
the data so that observations within the same child on different occasions have the same
correlation, ρ, whatever the lag apart, while observations of other children are correlated 0,
that is independent.
Figure 11 Block diagonal correlation structure - two level random intercepts model
- 112 -
The random-slopes model relaxes this by allowing the degree of similarity to vary with age
(Snijders and Bosker, 1999, 172). It is also possible to model explicitly the between-occasion
dependency through covariance terms, and a number of different formulations are available
which we will discuss later.
Three-level model
Differential school effects can be handed by a three-level model in which occasions are
nested in individuals who are seen as nested in schools. The combined model in its randomintercepts form (building on equation 10) is
yijk   0 x0ijk  1 x1ijk   2 x2 jk  (v0k  u0 jk  e0ijk )
(24)
where the dependent variable y ijk is distance covered on occasion i by child j of school k.
The random part has three elements:
which is the unexplained differential
achievement for a school around the linear trend;
which is the unexplained
differential achievement for a child given their age, gender and school; and
is the
unexplained occasion-specific differential given the child’s age, gender, and differential
performance. The distributional assumptions complete the model.
v0k  ~ N ,0( v20 );
u  ~ N ,0(
0 jk
2
u0
);
e  ~ N ,0(
0ijk
2
e0
)
(25)
The terms give the residual variance between schools, between children and between
occasions.
Figure 12 Block diagonal correlation structure - three level random-intercepts model
- 113 -
Figure 12 shows the correlation structure of a three level random intercepts model. There
are now two measures of dependency:
the intra-school correlation (within same school, different child)
2 
 v20
 v20
  u20   e2
(26)
And the intra-child correlation over occasion (within same school and the same child)
1 
 v20   u20
 v20   u20   e2
(27)
Some covariance structures for repeated measures data
The multilevel mode automatically corrects the standard errors of the fixed part of the
model for the degree of dependency in the repeated measures and thereby protects against
incorrectly finding statistically significant results. This is especially the case for time-invariant
variables which only vary at the higher level. But it can only do this if the nature of the
dependency has been modelled correctly. Consequently, it is worth considering alternative
forms of dependency, for as Cheng et al (2010, 506) argue ‘even when the covariance
structure is not the primary interest, an appropriate covariance model is essential to obtain
valid inference for the fixed effects’. Modelling the dependency appropriately is likely to
result in smaller standard errors, that is more efficient model-based inferences. It is
possible to estimate a number of different dependency structures with repeated measures
data by specifying different covariance patterns and therefore different correlations over
time. A number of commonly employed formulations are now described.43
Homogenous without dependency
The simplest possible covariance pattern, here shown for four occasions is:
Time _ 1  2

Time _ 2  0
Time _ 3  0

Time _ 4  0

2
0
2
0
0





 2 
In this formulation, there is
 equal variances along main diagonal, that is the same variance at each occasion so
that the structure is called a homogenous one;
 zero covariances in the off-diagonal positions, so there is no dependency over time;
43
The same algebraic symbols are used in the different formulations but they can have a different meaning;
there is internal coherence within a specification but not necessarily between.
- 114 -


the need for only one term, the pooled variance ( )to be estimated;
this is a standard ANOVA model that can be estimated by Ordinary Least Squares;
because there is no dependency, the specification is not a multilevel one.
A heterogeneous form of this simple structure would have a separate variance for each
occasion so that there are four terms to be estimated, but the covariances remain at zero so
there is no dependency.
Compound symmetry (random-intercepts model)
The simplest form that allows dependency is the ‘standard’ multilevel random-intercepts
model (as in equation 10) which is known as the compound symmetry formulation

Time _ 1  u20   e20


2
2
Time _ 2   u 0
 u 0   e0


Time _ 3   u 0
 u0
 u20   e20

2
2 
Time _ 4   u 0
 u0
 u0
 u 0   e 0 
This formulation has:




only two terms in the random part (the variance between-children,  u20 , and the
variance within-children, between-occasions,  e20 ) irrespective of how many
occasions there are;
the covariance is given by the  u20 , and the degree of correlation between any two
2
2
2
pairs of occasions is  u 0 /( u 0   e0 ) ;
consequently, the same degree of dependency, equal correlation, between
occasions irrespective of lag is imposed;
this model can readily be estimated as a two-level hierarchical model with occasions
nested within children.
Un-structured multivariate model
The least parsimonious model has un-structured co-variances

Time _ 1  12


2
Time _ 2  12  2

2


Time _ 3  13  23  3


Time _ 4  14  24  34  42 
This multivariate repeated measures structure has the most complex structure with
 separate variances on diagonal and separate co-variances on the off-diagonal
 the variance is estimated separately for each occasion (that is it is a
heterogeneous formulation), and separate co-variances for each pair of
occasions;
- 115 -




this allows the dependency structure to change with occasion, so that the
dependency between observations the same lag apart may be higher or later in
the sequence when the process under study ‘ has bedded in’;
the number of parameters grows rapidly with number of occasions; there are
m(m+1)/2 parameters needed in the model where m is the number of occasions;
in the Endurance example, we need to estimate 15 parameters for just 5
occasions which leads to less precise estimation; if there are 20 waves, 210
parameters will be needed;
the model can be estimated as multivariate multilevel model with each occasion
as a separate response; consequently, in this saturated model, there will be no
level 1 within-person, between-occasion variance;
the co-variances can be used to estimate the correlation as usual as the
covariance divided by the product of the square root of the variances such that
the correlation between Occasions 1 and 2 is given by  12 /( 1   2 ) .
Toeplitz model
Finally, we will consider three ‘middle-ground’ models in terms of parsimony. The Toeplitz
model is formulated so that the covariance between measurements is assumed to change
with the length of lag:

Time _ 1  02


2
Time _ 2  1  0

2


Time _ 3  2  u1  0


Time _ 4  3  2  1  02 
It can be seen
 the same variances on the diagonal at each occasion;
 co-variances on the off-diagonal are arranged so that occasions that are the
same lag apart have the same dependency;
 this has the same banded structure of the AR1 model (described next) but is less
restrictive as the off-diagonal elements are not forced to be an identical fraction
of the elements of the prior band, as they are separately estimated;
 the number of parameters grows linearly with number of occasions (so not as
rapidly as the unstructured model); in the Endurance example, there are only 5
parameters compared to the 15 of the unstructured model; when there is a long
series, the correlation at large lags can be set at zero;
 this model can be estimated in MLwiN either as multivariate multilevel model
with linear constraints on the random parameters imposed through the RCON
command or by an appropriate design structure imposed on a ‘standard’
multilevel level model using the SETD feature.
 A heterogeneous from with different variances is also possible.
- 116 -
First order autoregressive residual model
The first order autoregressive residual model (AR1) structure is so-called to distinguish it
from a true autoregressive model with a lagged response. It has the following covariance
structure

Time _ 1   2
 2

2

Time _ 2   

2 2
2
2


  
Time _ 3  
 2 2

Time _ 4    2 2  2  2 
which is equivalent, to a correlation structure:
Time _ 1  1
Time _ 2  (t2 t1 )
1
( t3 t1 )
( t3 t2 )
Time _ 3 

1
 (t4 t1 )
( t4 t3 )
( t4 t2 )
Time _ 4 







1
This formulation has:
 equal variances on the main-diagonal;
 off-diagonal covariance terms represent the variance multiplied by the
autoregressive coefficient,  , raised to increasing powers as the observations
become increasingly separated in time; such increasing power means decreasing
co-variances;
 the greater the time span between two measurements, the lower the
correlation is assumed to be; clearly this formulation can handle continuous time
as well as fixed occasions;
 with fixed occasions, there is a banded structure (like the Toeplitz) so that there
is the same correlation between responses one lag apart (that is occasion 1 and
2, 2 and 3, and 3 and 4); similarly the same correlation is imposed between
occasions 2 lags apart (occasion 1 and 3, 2 and 4);
 there is only one additional term, the autoregressive parameter,  , compared
to the compound symmetry model;
 in practice it has been found that for many data sets that this rather simple
model imposes too strong a decay on the dependency with lag;
 MLwiN has facilities for the estimation of homogenous and heterogonous AR(1)
models.
Autoregressive weights model
The final model (which is not generally used in the literature) can be called an
autoregressive weights model. Given the degree of parsimony, it is a very flexible
specification than can be estimated readily in MLwiN with likelihood procedures but not
MCMC

Time _ 1   02


Time _ 2  1 w1
 02

2

Time _ 3  2 w2  u1w1
0


Time _ 4  3 w3  2 w2  1 w1  02 
- 117 -
This formulation has
 the same variances on the diagonal on each occasion
 co-variances on the off-diagonal are arranged so that occasions the same lag
apart have the same dependency; dependency depends only on lag with fixed
occasions or time span with continuous time;
 with fixed occasions this has the same banded structure as the Toeplitz model
but is more parsimonious in that the autoregressive parameter,  , is a function
of a set of ‘weights’; thus the weights, w could be an inverse function of lag (1/1,
½, 1/3) so that there is less and less dependency the further the lags are apart; a
‘steeper’ decline in the dependency could be achieved by specifying inverse
power weights so that the three required weights, representing, the three
different lags are 1/12 , 1/22, and 1/32.
 there is only one additional parameter,  , in comparison to the compound
symmetry model;
 unlike the AR1 model, a rigid structure where the off-diagonal elements are
forced to be an identical fraction of the elements of the prior band, is not
imposed;
 in MLwiN the model can be estimated by choosing an appropriate design
structure on a ‘standard’ multilevel level model using the SETD feature; you have
to choose the weights, they are not part of the estimation process.
Block-diagonal structures
To help avoid confusion, it must be remembered that we have been showing the variancecovariance matrix for a single child. In reality this matrix is imposed as a symmetric block
diagonal structure; with a block for each and every child. Here is an extract from the full
matrix for 3 children measured on 4 occasions where C3t 4 is the fourth occasion for the
third child when an unstructured matrix is imposed.
C1t1  12

C1t 2  12  22
C1t 3  13  23

C1t 4  14  24
C 2 t1 0 0
C 2 t 2  0 0
C2t3  0 0

C2t4  0 0
C 3 t1 0 0
C3t 2  0 0

C3t 3  0 0

C3t 4  0 0
 32
 34
0
0
0
0
0
0
0
0




2
 4 
t1  12
0

t 2  12  22
0
t 3  13  23
0


t 4  14  24
0
0
0 0

0 0
0

0 0
0
3


0
0 0




 32

 34  42 
0 0
0 0
0 0

0 0
- 118 -
 12



2
 12  2

 13  23  32


2
 14  24  34  4 
We will subsequently fit a number of these models in MLwiN, but will only do so after fitting
a random–intercepts, compound-symmetry model. It is worth stressing again that not all
the models can be fitted in MLwiN. The MCMC estimation does not currently allow the
Toeplitz model, nor any user-defined constraints (that may have been set by the RCON
command) beyond a menu of alternatives; the SETD command cannot be used with MCMC
estimation but only with IGLS and RIGLS procedures.
Estimating a two and three level random intercepts and slopes model
This section details how to restructure the data from wide to long, how to check the quality
of the data, and then fits a series of models of increasing complexity. The approach to
model fitting that we have adopted is a hypothesis-informed, bottom-up, exploratory
approach (in the manner of Chapter 11). We want to know the general trajectory of
development, whether the form is linear or non-linear and the extent that boys and girls
living in urban and rural areas have differential development trends. This means that the
initial focus is on a simple random part with an elaboration of the fixed part model to derive
a concise formulation of the mean effects. Attention then focuses on developing the
random part to include higher-order random effects at the school level and a search for the
most appropriate dependency structure. In practice it is difficult to know the precise form of
the dependency so we have to try out a number of different forms to obtain the best fit and
to see the extent to which this makes a difference to the other estimates. There are
considerable tension in this section in what you would do in a real investigation and what
we have done here for pedagogic purposes in that we have tried to show a great variety of
models that could be needed for growth modelling in general but are not needed for the
specific data we are modelling.
Singer and Willet (2003, 75) recommend always fitting two initial models -the
unconditional means model and the unconditional growth model. The former is simply a
variance-components model without any predictors and simply calculates an overall mean
and partitions the variance into between- and within- child. The latter then includes some
predictor for time which in our case would be a function of age. These are two useful
baselines but we must remember that as we add fixed term predictors (and this includes
polynomials and interactions) the nature of the random part can change quite markedly.
Thus, as we shall see, apparent increasing heterogeneity between children in their growth
trajectories may in reality be due to growing mean differences by gender. It is thus
important to model the fixed part well before considering complex random parts. We do
not recommend some exhaustive machine –based automatic procedure as the machine will
not know your research questions and the literature on which they are based.
- 119 -
Structuring the data from wide to long
The original data are given in wide format with columns giving the Endurance distances at
different occasions and the rows giving the individual children.



File on main menu
Open worksheet
Filename
Endurancewide.wsz
This will bring up the Names window which will give a summary of the data
There are four blocks of variables
 Endurance measures: the distance covered by each child in hundreds of metres on 5
separate occasions; we have chosen hundreds of metres as the scale of
measurement as this will result in sensible estimates for the coefficients; use of
metres or kilometres would have produced very small or large absolute values that
may have hindered convergence. Notice that there are some missing values for the
four later measurement occasions, but all are measured in the first year of
recruitment;
 Age variables: age of the child in decimal chronological years at each planned
measurement occasion;
 Occasion indicators: the calendar year of the planned measurement occasion;
 Child variables: these include a Child and School identifier, a Sex and Rurality dummy
and a set of 1’s, the Constant
It is helpful to have a look at the original data; in the Names window
 Highlight all 20 variables
 Data in toolbar and click on View
- 120 -
to get the following data extract:
The missing values are clearly seen as is the categorical nature of Sex and Rurality; it is a
good idea to use a different label for categorical outcomes than that used for the column
name (thus Sex as column and Boy/Girl as label) as MLwiN will use the label to name any
columns of dummy variables created from the categories; otherwise there could be a
naming conflict.
To fit a simple repeated measures model in MLwiN we have to turn these data into a
long format where the rows represent different occasions. We also have to deal differently
with occasion-varying variables (Endurance, Age, Occasion) and the occasion-invariant
variables (Sex, Rurality, the constant, the Child and School identifier)









Data manipulation on Main menu
Split records (this restructures data from wide to long)
Number of occasions ,set to 5, that is the number of occasions
Number of variables set to 3, as there are 3 occasion-varying variable
specify End1 for Occasion 1, End2 for Occasion 2 for Variable 1 , and so on;
and then Age1 to Age5 for Variable 2, and Occ1 to Occ5 for Variable 3
Stack the repeated measures into empty columns c22, c23 and c23 this will
give the interleaved data
In the Repeat(carried) data, highlight the occasion-invariant variables Child,
Sex, Rurality, Constant, and School, choosing the Output columns to be the
Same as input
Tick on Generate indicator column and choose to store in the empty column
c21, this will give you the occasion number of the 5 occasions
Split (No to Save worksheet)
- 121 -

Name the columns in the Names window as follows (use different names to
those of existing columns)
C21: Occasion
C22: Endur
C23: Age
C24: Year
After this naming, the Names window should look like (we have also added a description to
the new variables) , and you can see that they now consist of 1500 observations (300
respondents by 5 occasions).
Save the worksheet with something like EndurLong.wsz
The correct structure of Occasions within Child within School has now to be imposed on the
data, and this is done by sorting on the 3 key variables and carrying the rest of the data (that
is all the variables of length 1500 must be involved in this process).
- 122 -




 Data Manipulation on main menu
 Sort
 Number of key codes 3
School as highest
Child as middle
Occasion as lowest
Select all the ‘long ‘variables (Child to Year)
Choose Same as input to put back the sorted data into the original columns
Add to Action List
Execute
Highlight these long columns in the Names window and View them
- 123 -
It can be seen that Child 1 attends School 1, he is a Boy living in a Rural area, he was aged
7.3 years in 1994 at Occasion 1. There are no missing endurance values for him and he
covered 19.1, 26.0, 26.8, 26.0 and 30.3 hundreds of metres in successive 12 minutes runs
tests.
Checking the data: graphing and tabulating
Before modelling the data, it is sensible to get some feel for the values involved by carrying
out a descriptive analysis. To plot a histogram of the Endurance, the sequence is:





Graphs on Main menu
Customised graphs
Choose Endur for Y on the plot what? Tab
Choose Plot type to be a Histogram
Apply
To get the following graph, once labelled
There is some positive skew but remember the required assumption is only for approximate
Normality conditioning on the fixed part of the model; we can check this after model
estimation.
To obtain a real feel for the variation of the data, plot a line for each child showing their
development over time







Graphs on Main menu
Customised graphs
Choose Endur for Y on the plot what? Tab
Choose Age for Y
Choose Plot type to be a Line
Choose Group to be a Child
Apply
- 124 -
To get the following graph, once labelled
While there is some general increase in the Endurance with age, there is also substantial
between-child variation.
Question 1: obtain a plot of Endurance against Age separately for men and women
(Hint: use Col codes on the graph), what do you find?44
___________________________________________________________________________
To obtain the averages for Boys and Girls by the occasion of measurement
 Basic statistics on Main menu
 Tabulate
 Click on Means
 Choose the Variate column to be Endur
 Columns of the cross-tabulation to be Sex
 Tick on Rows of the cross-tabulation to be Occasion
 Tabulate
44
Answers are at the end of this chapter.
- 125 -
The results will appear in the output window from where they can be copied to a word
processor:
Variable tabulated is Endur
Columns are levels of Sex
Rows are levels of Occasion
1
N
MEANS
SD'S
Boy
149
16.6
3.74
Girl
151
13.8
3.09
TOTALS
300
2
N
MEANS
SD'S
140
18.2
4.15
144
15.3
3.35
284
16.7
3.77
3
N
MEANS
SD'S
139
20.0
3.99
135
16.4
3.20
274
4
N
MEANS
SD'S
135
21.3
4.22
141
16.4
3.37
276
5
N
MEANS
SD'S
141
23.4
3.96
146
17.3
3.15
287
TOTALS
MEANS
SD'S
704
19.9
4.01
717
15.8
3.23
15.2
3.42
18.2
3.62
18.8
3.81
20.3
3.57
1421
17.8
3.64
It can be seen that the data are well balanced in terms of gender with approximately the
same number of Boys and Girls on each occasion. Boys consistently have a higher mean
than Girls and there is an increasing distance run with each occasion that is more marked for
Boys than Girls.
Question 2: use the Tabulate command to examine the means for Endurance for the
cross-tabulation of Sex and Rurality, what do you find?
___________________________________________________________________________
It is worth examining the missingness in terms of the other observed characteristics. To do
this we first have to recode the Endurance values into Missing or Not.








Data Manipulation on main menu
Recode by Range ( as Endur is a continuous variable)
Value in range Missing to Missing to New Value 1 (just type in M)
Value in range 0 to 100 to New Value 0 ( a range encompassing all the observed
data)
Input column Endur
Free columns
Add to Action List
Execute
- 126 -
In the Names window, re-name c25 to be Missing and toggle categorical giving the labels
Not to 0 and Yes to 1. Then use tabulate to examine how this missingness relates to Sex45
Columns are levels of Missing
Rows are levels of Sex
Not
Yes
Boy
N
704
41
ROW %
94.5
5.5
Girl
N
717
38
ROW %
95.0
5.0
TOTALS
1421
79
ROW %
94.7
5.3
TOTALS
745
100.0
755
100.0
1500
100.0
Some 5% of both Boys and Girls are missing so that there does not appear to a selective
process in terms of Sex.
Question 3: are there any distinctive patterns of missingness in terms of Rurality
and Occasion?
___________________________________________________________________________
45
June 2011, I had to use a workaround so that the Variate in the previous Tabulate when Means were chosen
was replaced by Missing; or try closing and opening the Tabulate window.
- 127 -
Listwise Deletion
As one of the procedures that we will use later (the SETD command) only operates with
non-missing data for the dependent variable, we will remove the missing values from the
dataset for the long vectors using the Listwise delete facility. As we will be operating on the
long vectors, this means that only missing data from specific occasions will be dropped. Thus
we are retaining all the information that we have.







Data Manipulation on main menu
Listwise
Listwide delete on value Missing (just type in M)
Input columns highlight all the ‘long ‘variables (Child to Year)
Choose Same as input to put back the non-delete data in to the original columns
Add to Action List
Execute
so that the data in the View window now looks like
- 128 -
There are now only 1421 (instead of 1500 observations) and you can see that the missing
responses (eg for child 3 on occasion 3) have been deleted.
The two-level null model
To set this up in MLwiN:










From the Model menu, select Equations
Click Notation at the bottom of the Equations window, ensure that only the General
box and Multiple subscripts box are ticked
In the Equations window, click on y and select Endur from the drop-down list
Click on N levels and select 2-ij to specify a 2-level model
For level 2(j) select Child
For level 1(i) select Occasion
Click Done
Click on the  0 X 0 term which is currently shown in red (which signifies that it is
unspecified) and choose Cons from the drop-down list,
click on j(Child), i(Occasion), followed by Done
Click the + button at the bottom of the Equations window twice to get the full
specification
The model represents a null, empty or unconditional 2 –level random intercepts model
Before proceeding to estimate model it is a good idea to check that the data structure is as
expected:


Model on Main Menu
Hierarchy viewer
to get the following results:
- 129 -
There are, as expected, 300 children with up to 5 measurement occasions nested within
each child. The missing observations for the dependent variable are shown by fewer than 5
observations.46
To estimate the model:
 Click Start on the main menu to begin the estimation from scratch
 Click Estimates at the bottom of the Equations window twice to see the parameter
estimates; the green colour signifies that the model has converged
 You are strongly encouraged to Save the models as we proceed under a unique,
distinctive and memorable filename, when you do so, the current model, estimates,
any graphical commands and the current worksheet are stored in a ‘zipped’ format’
which can be subsequently Opened but only by MLwiN.
46
Now that the structure has been defined we can give the following commands in the Command Window to
see the pattern of this missingness
MLCOunt 'Child' c26 [Counts the number of observations in each child block, store as long in c26]
TAKE 'Child' 'c26' 'c26' [Takes the first count for each Child block and put is c26 now short]
TABUlate 1 'c26' [Tabulate c26 short; 1 requests row percentages]
3
4
5
TOTALS
N
10
59
231
300
%
3.3
19.7
77.0
100.0
There are only 10 children with two observations missing, none with four.
- 130 -
This model gives us the grand mean mean endurance of 17.83 hundred metres for all
children and all occasions irrespective of Sex, Age and Rurality. The between-child variance
at 11.699 is large in absolute terms and in comparison to the standard error, and is indeed
larger than the between-occasion, within-child variance which is 9.112. This substantial
difference between children suggests a great deal of similarity within, and we can readily
calculate the degree of similarity for this random intercepts model

 u20
 u20   e20
(28)
Using the Command Interface (accessed through Data Manipulation), we can calculate this
to be
calc b1 = 11.699/ ( 11.699 + 9.112)
0.56215
Some 56% of the variance in Endurance lies at the child level, or equivalently if we pick any
pair of Endurance measures on different occasions for the same child, we have an
expectation that they will be correlated with a value of 0.56. Currently, we are not allowing
different correlations between different pairs of occasions as this is a compound symmetry
model. At the bottom of the Equations window, Store these results as Model 1:Null.
Elaborating the fixed part of the model
Polynomial terms for Age
We are now going to fit a sequence of growth models and compare them. The next model
includes continuous Age as a linear main effect centred around its grand mean.47







Equations window
Click Add term at the bottom of the window
Choose variable to be Age
choosing Centering on the grand mean (to avoid convergence problems and to give
a meaningful interpretation to the estimate associated with the intercept)
Done
More to convergence (use More as the estimates are likely to have changed little)
Store on bottom tool bar, naming the model 2+age
On convergence the estimates are as follows
47
If you only have discrete time such as the calendar year in which the measurement was taken you are well
advised you use orthogonal polynomials which makes for more stable estimation. You are also advised to think
of the scale of the time metric so that the associated parameter estimates do not become too large or small.
Thus, Verbeke and Molenberghs (2000,50) specified their time variable as decades rather than years in their
analysis of prostate cancer data; when the later was used, the model estimated (in SAS) failed to converge
- 131 -
Question 4: why does Age-gm have a ij subscript and not a j subscript?
The new term means that Endurance increases on average by 0.629 hundred of
metres as children (Boys and Girls, Urban and Rural grow older by one year). The average
distance covered by a child of average age (11.4 years ; use Average under Basic Statistics
to get this value) is 17.838 hundreds of metres. Comparing these results with the previous
unconditional model shows that the between-child variance has gone up but only trivially so
in comparison to the standard error, while the between-occasion variance has come down
substantially in comparison to its standard error. Taking account of their age children are
very similar over occasions and the degree of dependency in this linear growth model is
given by:
calc b1 = 12.311/ ( 12.311 + 5.088)
0.70757
There is clearly a great deal of depenndcy over time. Development researchers call this
tracking as each child follows their own path in a dependent way.The deviance has also
dropped substantially suggesting a better fit of the model.
We can now see if this growth is linear or non-linear, beginning with a quadratic model





Equations window
Click on Age-GM and choose Modify term
In the pop-up window, tick on Polynomial and select Poly degree to be 2
Done
Start to convergence to get the following estimates
- 132 -
There is some evidence that there is a decelerating positive slope as there is a positive
coefficient for linear slope for Age and a negative slope for Age2 (although the latter is not
very greatly bigger than the standard error). Store this model as 3+AgeSq
Question 5: repeat the above procedure to see if a cubic term is necessary
___________________________________________________________________________
Gender effects
The next research question is whether boys are different from girls, so the next model
includes Sex as a main effect with Boy as the reference category






Equations window
Click Add term at the bottom of the window
Choose variable to be Sex
Reference category choose to be Boy to create a single dummy variable to be called
Girl
Done
More to convergence to give the following results
- 133 -
Question 6: why does Girl have a j subscript and not an ij subscript?
___________________________________________________________________________
The average-aged Boy covers a distance of 20.038 while the average Girl is substantially and
significantly lower with a mean of 20.038 – 4.086. Store the estimates from this model as
4+Sex
Gender and Age interactions
We have found that on average boys have greater endurance than girls, we now want to
know if their differential trajectories diverge as they grow older. Consequently, we next try
to estimate differential polynomial growth trajectories for Boys and Girls







Equations window
Click Add term at the bottom of the window
Order 1 so as to be able to specify and interaction between Age and Sex
Choose the top variable to be Age, you will automatically get a 2nd order polynomial
Choose the bottom variable to be Sex, you will automatically the reference category
of Boys
Done
More to convergence
An informal inspection shows that there is an slightly accelerating positive slope for Boy’s
age (the quadratic term is not large in comparison to the standard error) while both the
differential linear and quadratic slope for Girls are large in comparison to their standard
error. Thus, the predictive equation is:
Predicted Endurance = 19.815 + 0.856*Age-GM + 0.010*(Age-GM)2 for Boys
Predicted Endurance = (19.815 -3.596) + (0.856-0.443)*Age-GM + (0.010-0.059)*(Age-GM)2
for Girls, that is
Predicted Endurance = 16.22 + (0.413)*Age-GM + (-0.049)*(Age-GM)2 for Girls.
- 134 -
We can perform a set of Wald tests on the linear and quadratic slopes, the different Girl
intercept and the differential linear and quadratic slopes for Girls. We are not really
interested whether the average-aged Boy term is different from zero, so there are 5 terms
we wish to test. Each test is of the form:
H0 : m = 0




the null hypothesis that population terms associated with variable m is zero
Model on Main menu
Intervals and Tests
Choose Fixed part with 5 functions; this gives the number of tests to be carried out
Placing a 1 in a column means that the associated estimate is involved in the testing,
a 0.00 in the row labelled Constant(k) means that we are testing against a null
hypothesis of 0. Thus, the in the first column, the function result(f) gives the result
of ( 1 *  Age ), while f-k gives how far away the result is from the null hypothesis of
zero. The chi-square value in next line tests gives the result of the test with 1 degree
of freedom. At the bottom of the results table is the chi-square value for the joint
test that all 5 parameters are zero.
Inspecting the results we see that all the chi-square values of the individual coefficients have
large chi-square values with the exception of the quadratic slope for Boys. We can calculate
the p values associated with the smallest and second smallest chis-square with 1 degree of
freedom by entering the following commands
cpro 14.073
0.00017585
1
the differential quadratic term for Girls is highly significant
- 135 -
cpro 0.836 1
0.36054
the quadratic term for Boys is not significant by conventional
standards.
However, we would normally retain the Boys quadratic term in the model as a ‘higher-order
term’ - the differential quadratic term for Girls- is needed and this is contrast to the Boy’s
quadratic term.48 We can also test whether this full model is a significant improvement over
the empty model.
cpro 1296.512
6.4746e-284
5
with 5 degrees of freedom
A very small p value is found but it is known that the log-likelihood test has better
properties for model comparison and we can compare all the models we have fitted and
look at the change in the deviance between the first and fifth models.
 Model on Main window
 Manage store models
 Compare models 1 and 5
1 Null
Endur
S.E.
Response
Fixed Part
Constant
17.830 0.213
(Age-gm)^1
(Age-gm)^2
Girl
(Age-gm)^1.Girl
(Age-gm)^2.Girl
Random Part
Level: Child
Constant/Constant
11.699 1.117
Level: Occasion
Constant/Constant
9.112 0.385
-2*loglikelihood:
7758.281
Units: Child
300
Units: Occasion
1421
48
5 +poly(Age,2)*Girl S.E.
Endur
19.815
0.856
0.010
-3.597
-0.443
-0.059
0.264
0.028
0.011
0.373
0.040
0.016
8.155 0.746
4.503 0.190
6847.129
300
1421
It is well known that lower order terms should be included if higher order ones are in the fixed part of the
model; a requirement that Peixoto (19900 calls a ‘well-formulated‘ specification. This applies to both
interactions and higher order polynomials. Morrell et al (1997) have generalised this to random effects
models so that if we want to include quadratic random time effects we must also include linear random effects
and random intercepts. Morrell, C.H., Pearson, J.D., and Brant, L.J. (1997) 'Linear transformations of linear
mixed-effects models,' The American Statistician, 51, 338-343.Peixoto, J.L. (1990) A property of wellformulated polynomial regression models, The American Statistician, Vol. 44(1),26-30. See also Braumoeller, B
F (2004) Hypothesis testing and multiplicative interaction terms International Organization, 58: 807-820.
- 136 -
The change in the deviance is 7758.281 – 6847.19, and we can calculate the associated p
value for this difference treating it as a chi-square distribution with 5 degrees of freedom
(the difference in the number of parameters)
calc b1 = 7758.281 - 6847.19
911.09
cpro b1 5
3.8088e-200
A huge difference and a very small p, but the difference is considerably less than that
suggested by the more approximate Wald test.
The best way of appreciating the nature and the size of effects with polynomials and
interactions is a plot of predictions and their confidence intervals. The easiest way of doing
this is through the Customised predictions facility which in its setup phase requires a ‘wish
list’ for the values of the predictor variables.


Model on Main menu
Customised predictions
In the Setup window, we have to consider each of the variables in turn and choose the
values that we want predictions of Endurance for; there is no need to consider polynomials
and interactions as those specified in the equations window are automatically included;
Here, we specify Age from 7 to 17 for both Boys and Girls.



Click on Sex
Change Range
Select Mean not Category
Tick on Boy and Girl
Done
Leave Constant at the value 1
Click on Age
Change Range
Select Range
Upper bound 17
Lower
7
Increment
1
Done
- 137 -


Click on Fill Grid
[to get the values for the predictor variables you specified]
Click on Predict Grid [to get the predicted values and CI’s you specified, here 95%]
Move to the Predictions sub-window, which gives the following results
- 138 -
The first 3 columns on the left, give the variables and the values we have requested – the
‘wish list’. The three remaining columns give the mean predicted Endurance with 95% lower
and upper confidence intervals.49 All 5 of these columns are named and stored in the
worksheet; and the values can be copied to a table in Word.50
To display the results graphically


In the Predictions sub-window
Click on Plot Grid
Choose Y axis to be mean.pred [ the predicted values]
Choose X axis to Age.Pred [which contains the values 7 to 17]
Choose Grouped by to be Sex.pred [ to plot a line for each Boy/Girl group]
Choose Graph Display to be 1 [ie D1, existing graphs will be overwritten]
Tick for 95% confidence intervals, displayed as lines as the X is continuous
Apply
After some re-labelling we get a very clear graph of results (choosing Scale for X to be user
defined and set to 7 to 17 with 10 ticks, ticking off the Show margin titles, and requesting
that the labels for the Group - Boy/Girl are plotted on the graph)
49
Other confidence interval ranges could have been specified in the Set-up window
These predictions are not in fact calculations (which they could be in this case) but simulations; this
approach allows great flexibility in non-linear models; the precision of the simulation can be increased by
simply increasing the number of requested simulations. Very importantly they fully take account of the
standard errors of the estimated coefficients including the covariances.
50
- 139 -
The mean endurance of Boys and Girls increases with age, Boys have greater endurance
throughout the 7 to 17 age range and the difference between Boys and Girls increases with
Age. The development for Boys between 7 and 17 looks to be largely linear while Girls show
a positively decelerating curve, and there is not much development beyond 15 or so years of
age. The inflection point for girls can be calculated by using equation (21). The overall
equation for Girls is
Predicted Endurance = 16.22 + (0.413)*Age-GM + (-0.049)*(Age-GM)2 for Girls.
so the turning point is
Calc b1 = -0.413/ (2* (-0.049))
4.2143
and to this we need to be the average age of 11.4 to get
calc b1 = b1 + 11.4
15.614
Maximum endurance is reached for Girls at around 15-16 while boys continue to
development beyond that age. It is also worth stressing that as this is a panel and not a
cross-sectional study, for the same children have been followed over time, the evidence for
genuine development of endurance (and not some artefact) is strong.
- 140 -
Question 7: repeat the above procedure, building on Model 5, to see if a Rural
main effect is needed and if there is an interaction between Rurality and Age –
choose Urban as the base
___________________________________________________________________________
Rural Gender Interactions
In completing question 7 you will have found out that there is a main effect for Rurality but
not a cross-level interaction affect with Age. Rurality has conferred some benefit in
endurance that is maintained throughout childhood and adolescence. The next research
question is whether this benefit applies equally to boys and girls- the idea being that boys
rather than girls may be expected in traditional farming communities to engage in more
demanding physical work and therefore have greater endurance. This is also the final
elaboration of the fixed part that is possible with the available data.
The current model with a main rural effect and an age-by-sex interaction is
To add the Rural by Sex interaction







Equations window
Click Add term at the bottom of the window
Order 1 so as to be able to specify and interaction between Sex and Rurality
Choose the top variable to be Sex, you will automatically get the reference category
of Boys
Choose the bottom to be Rural, you will automatically get the reference category of
Urban
Done
More to convergence
To get the following results which you can Store as 8:Sex*Rural
- 141 -
A boy of average age who lives in a Rural area has an endurance that is 300 metres longer
while a girl is some 200 hundred metres less than this. We can use the Wald test to evaluate
three null hypotheses:
H 0 :  Rural = 0
the null hypothesis that population terms associated with Rurality is
zero; because of the interaction with Sex, this is a test of the
difference between Rural and Urban Boys.
H 0 :  Girl.Rural = 0
this is a test of the differential difference between Rural Boys and
Girls.
H 0 :  Rural   Girl.Rural = 0
this is a test of the difference between Rural and Urban Girls.
We use the intervals and test procedure specifying 3 tests of the fixed part of the model.
- 142 -
To test the third hypothesis we have included two 1’s in the third column to get the function
result of (1 * 3.056) + (1 * -2.044) equals 1.102, that is the Boy-Girl difference in rural areas.
All three effects are significant at conventional levels so that Rural Boys have greater
endurance than Urban boys, Rural Boys have greater endurance than Rural Girls while Rural
Girls have greater endurance than Urban girls. Again the best way to appreciate the size of
the effects is a customised predictions plot where the ‘wish list’ consists of two Sex
categories, 2 Rurality Categories and 11 age groups from 7 to 17.
First a plot that emphasises the Sex differences by grouping on Sex on the same graph but
trellis into different columns with different graphs for the Rural/Urban children.
- 143 -
Second a plot that emphasises the Rural/Urban differences by grouping by Urban/Rural on
the same graph and trellising by Sex.
- 144 -
We already now that the Rural –Urban difference in general (that is for both sexes) does not
change with Age, we could fit a model that allows this differential to be change with Age but
differentially by Sex. This requires us to add a first order interaction between Rurality and
Age (which we had previously removed) and a 2nd order interaction with Age, Gender and
Rurality.
Fortunately for parsimony and our sanity, the effects get little empirical support and if we
do a likelihood ratio test with the previous model, the difference of 6802.099- 6799.196
with 4 degrees of freedom receives a high insignificant p value of 0.57. A word of warning –
it is all too easy to try out complex interactions and then justify them post hoc by theoretical
musings. This is equivalent to the Texas sharp shooter who first fires at the barn door and
then draws the target around where the shots have accidentally concentrated! The most
complex model that receives empirical support is model 8 and that now provides a base to
consider elaborations of the random part of the model.
Interpreting and elaborating the random part
The random intercepts model
Staying with the framework of the two–level multilevel model we can see how the random
part has changed after elaborating the fixed part of the model by comparing the original null
model with a model that has fixed terms for Age , Age2., Sex, Age by Sex, Rurality, and Sex
by Rurality.
- 145 -
1 Null
S.E.
Fixed Part
Constant
17.830 0.213
(Age-gm)^1
(Age-gm)^2
Girl
(Age-gm)^1.Girl
(Age-gm)^2.Girl
Rural
Girl.Rural
Random Part
Level: Child
Constant/Constant
11.699 1.117
Level: Occasion
Constant/Constant
9.112 0.385
-2*loglikelihood:
7758.281
8 + Sex*Rural S.E.
18.114
0.856
0.010
-2.473
-0.444
-0.059
3.056
-2.044
0.356
0.028
0.011
0.507
0.040
0.016
0.462
0.652
6.885 0.642
4.503 0.190
6802.099
Both the between-child and between-occasion variance have come down as the fixed part
has been elaborated. The total unexplained variation was 11.699 + 9.112 = 20.811 and now
it is 6.885 + 4.503 = 11.388 so that so that the fixed part elaborations accounted for (20.811
- 11.388)/20.811 or some 45% of the original variation. The majority of the remaining
residual variation lies at the child level, 6.885/ (6.885 + 4.503) so that some 60% of the
variation is between children. In this conditional compound symmetry model there is
considerable similarity or dependence over occasion with endurance being auto-correlated
with a value of 0.6 over occasions.
An informal c heck of the level 2 residuals can be obtained by plotting a Normal plot if
the child level residuals


Model on Main Menu
Residuals
Settings
Level 2 : Child
Calc
Plots
Standardised Residuals by Normal Scores
Graph to D10
Apply
- 146 -
The residuals plot very much as a straight line suggesting that the shrunken residuals are not
markedly non-Normal and there are not substantial child outliers.51
Question 8: what would happen to the standard errors of the fixed part if it was
assumed there was no dependence?
There are a two ways we can characterise the scale of the between-children differences.
The first uses the coverage facility in the customised predictions window. The second uses
51
Verbeke and Molenberghs (2000, 83-92) suggest that this procedure should be treated with some
scepticism as shrinkage would result in the estimated residuals being made more Normal when the true
random effects are not. They suggest that the only effective way to asses this to fit more complex models with
relaxed distributional assumptions.. Verbeke G. and Lesaffre E. (1996) `A linear mixed effects model with
heterogeneity in the random effects population,' Journal of the American Statistical Association, 91, 217-221.
- 147 -
the predictions window to make a plot of the modelled estimates of the children’s growth
which have been purged of their occasion-level idiosyncrasies and or measurement error.
We will start with the second method and plot the growth curves for Boys of different ages
who live in urban areas. (As the model has the same variance for all types of children, plots
for other groups will only shows a shift up and down and not greater or less variation.)







Model on Main Menu
Predictions
Fixed: click on the terms associated with the Constant, and the linear and quadratic
Age terms; leave other terms greyed out
Random level 2: click on the random intercept, uoj
Random Level 1: leave the level 1 term greyed out
Output from prediction to c50 [an empty column]
Calc
To obtain a plot of the modelled growth trajectories







Graphs on Main menu
Customised graphs
Choose c50 for Y on the plot what? Tab
Choose Age for Y
Choose Plot type to be a Line
Choose Group to be a Child
Apply
- 148 -
The scale of the between child heterogeneity is apparent – even when we have modelled
out Sex and Rurality differences there are considerable differences between children of the
same age. The parallel lines is of course the random-intercepts assumption that variance
does not change with Age.
Turning now to the other approach we can calculate the coverage interval for the
average child


Model on Main menu
Customised predictions
Setup sub window
change the range of each variable to get the ‘average’ mean values
Tick on Level 2 (Child) Coverage and choose 95% coverage interval
Fill Grid
Predict
Switch to the Predictions sub window
- 149 -
To get the following results
The typical child covers 18.0 hundred of metres in 12 minutes and the 95% confidence
intervals around this average are from 17.67 to 18.25. In the population the typical child
with the best 2.5% of all endurance distances will cover 23.32, while the child in the poorest
2.5% will only cover 12.77. Once again individual child variability is seen to be substantial;
children are very variable in their endurance.
The random slopes model with complex heterogeneity between occasions
All the models so far have assumed a constant, unchanging between child and within child
variation as the children age. We can allow for more complex heterogeneity by allowing the
linear term for Age to have a child differential slope and thelevel -1 residuals to also depend
on linear Age. This will allow the different characteristic patterns that were illustrated in
Figure 5 and 6. The modified equation looks like
This on convergence gives the following.
- 150 -
We can use a Wald test to evaluate the approximate significance of the four new terms in
the random part of the model
There is some evidence that the ‘linear’ part of the variance function is worth investigating
at both levels. The positive covariance at the child level suggest that differences between
children are growing with age, while the negative covariance at level 1 suggests that
volatility around an individual child’s growth curve is decreasing with age. 52
We can use the Variance function window to store the variance function at each
level and use the calculate window to calculate the VPC as a ratio of level 2 to level 1 +
level2, just as we did in Chapter 5 of volume 1. We have named some free columns as
Lev2Var, Lev1Var and VPC and Lev1VarCI and Lev2varCI to hold the 95% confidence
intervals. First for level 1
52
There is nothing untoward about the ‘variance’ terms at level 1 being negative as long as the overall
variance function does not become negative.
- 151 -
And then for level 2
And then calculate the VPC
Graphing each function against Age with their confidence intervals (you must use Offsets
capability on the graphs not the Values option, as the output from the confidence interval
on the Variance function is departures from the general line); we get the following graphs:
- 152 -
We see some evidence that the between-occasion volatility is decreasing with Age and the
between Child differences are increasing so that the VPC, which gives the proportion of the
variance at the child level, is also increasing. However, the intervals are quite wide enough
to ‘thread a straight line through’. Unfortunately it is only possible to calculate the
confidence bounds for the VPC in MCMC (see Chapter 12 of this volume: Functions of
parameters).
It is also possible to calculate the correlation between any pair of continuous time
observations using the formula given in Snijders and Bosker (199,172) such as a 7.5 year old
(t1 )and a 16.5 year old (t2). As usual the correlation is the covariance divided by the product
of the square root of the variances s so for time t1 and time t2 the correlation is
(
)
)
(29)
√
The covariance is given by
(
(
)
)
(
)
(30)
and the total variance at time t is given by the sum of the level 1 and level 2 quadratic
variances so for t1 it is
( )
( )
( )
( )
(31)
It must of course be remembered that in the formulation for the random slopes model, Age,
which is our t variable, has been centred around its grand mean of 11.439 so that a 7.5 year
old equals -3.939, while the equivalent value for a 17.5 year old is 6.061. The covariance is
therefore given by
6.894 + 0.127*(-3.939 + 6.061) + 0.006*(-3.939 * 6.061) = 7.02025
- 153 -
The total variance at each of these two ages is
6.894 + 2*0.127*(-3.939) + 0.006*(-3.939**2) + 4.585 + 2*( -0.082* (-3.939)) + (-0.018*(3.939**2))
and
6.894 + 2*0.127*(6.061) + 0.006*( 6.061**2) + 4.585 + 2*( -0.082* (6.061)) + (-0.018*(
6.061**2))
which is 11.3107 and 11.5837 respectively. So that the correlation is equal to 7.02025/
(11.3107 * 11.5837)**0.5, that is 0.61. These results again suggesting that there is quite a
high dependency even in endurance measurements that are 9 years apart. There is thus
strong evidence of tracking; individual children tend to follow a distinctive course. We will
in a later section consider more complex dependencies over time.
Question 9 make a plot of the modelled growth trajectories for Urban Boys using
the above random slopes model; what does it show?
___________________________________________________________________________
Three level model: between schools variation
It is straight forward to extend our two-level model to a three-level one with Schools as the
higher third level.

Equations on Main menu
Click on Endurij
N levels: choose 3-ijk
Lev3(k): choose Schools
Done
Click on Constant
Tick on k(Schools)
Done
This gives a random intercepts term at level 3 so that the specified model is
- 154 -
The hierarchy viewer shows that there are 10 children in each of 30 schools that have been
sampled.
After convergence the model estimates are as follows.
- 155 -
The change in the deviance is from 6793.492 to 6774.206, a sizeable improvement with a
single term, Chapter 9 of volume 1 warns us that in comparing two models that differ in
their random parts we should be using the REML/RIGLS estimates and not the current
FIML/IGLS ones. Using the Estimation Control button in the Main menu, change to RIGLS,
and re-estimate the models. The change in the deviance from including School effects is
from 6793.52 to 6774.26, so that there is only a small departure from the maximum
likelihood IGLS results. We can calculate the associated p value and because we are testing a
variance that cannot take on values of less than zero (the null hypothesis) we can follow the
advice of Chapter 6 and halve the resultant p value.53
calc b1 = 6793.52 - 6774.26
19.260
cpro b1 1 b2
1.1407e-005
calc b2 = b2/2
5.7036e-006
If we believe in such a frequentist approach to probability there is evidence of significant
unexplained school effects. However this depends on treating 30 schools as sufficient to get
reliable results while some software (such as LME in R) does not even give the standard
error for the variance of higher-level random terms as the distribution of the estimator may
be quite asymmetric. Moreover while the REML/RIGLS estimates are an improvement in
taking account of the uncertainty of the fixed part in estimating the random part, the MCMC
estimates are needed to take account of uncertainty in all parameters simultaneously and
also allow for asymmetric credible intervals.
The two level model was run in REML and in MCMC and the three-level model was
also run with both types of estimation. Here are the comparisons when a burn-in of 500 has
been used and a monitoring run of 50k simulations. The two-level results are as follows.
Question 10 what do you conclude from these results given below; has the
monitoring chain been run for a sufficiently long length; are there differences
between the results obtained from the two types of estimation?
53
The logic is that with replicated data sets generated under the null hypothesis we would get positive values
half of the time and negative the other half but these would be estimated as positive half of the time and zero
the other half. Consequently, with random intercepts the sampling distribution is made up of a 50:50 mixture
of chi-square distribution with a df of 1 and a spike at zero so this can be achieved by halving the incorrect p
value based on chi-square with 1 degree of freedom
- 156 -
Response
Fixed Part
Constant
(Age-gm
2
(Age-gm)
Girl
(Age-gm).Girl
2
(Age-gm) .Girl
Rural
Girl.Rural
Random Part
Child
Cons/Cons
(Age-gm)/Cons
(Age-gm)(Age-gm)
Occasion
Cons/Cons
(Age-gm)/Cons
(Age-gm)/(Age-m)
Deviance
DIC:
2 Lev
RIGLS
Endur
S.E.
2 level
50k
Endur
S.E.
Median
CI(2.5%)
CI(97.5%
ESS
18.10
0.855
0.008
-2.489
-0.440
-0.056
3.086
-2.055
0.358
0.029
0.011
0.509
0.040
0.016
0.463
0.653
18.093
0.854
0.008
-2.467
-0.439
-0.056
3.100
-2.074
0.357
0.029
0.011
0.508
0.041
0.016
0.461
0.652
18.093
0.854
0.008
-2.465
-0.439
-0.056
3.100
-2.071
17.388
0.798
-0.013
-3.461
-0.519
-0.086
2.192
-3.345
18.786
0.911
0.030
-1.483
-0.359
-0.025
4.018
-0.798
4821
37969
43002
3706
38507
43614
3571
3425
6.998
0.128
0.007
0.651
0.059
0.013
7.070
0.131
0.008
0.662
0.053
0.005
7.037
0.130
0.007
5.875
0.030
0.002
8.477
0.239
0.020
29077
1100
470
4.584
-0.082
-0.016
6793.532
0.308
0.038
0.033
4.603
-0.083
-0.013
0.322
0.038
0.031
4.596
-0.082
-0.013
3.997
-0.159
-0.073
5.261
-0.009
0.051
1766
5093
1483
6432.12
The results for the 3-level model (again with 50k simulations) are as follows
Response
Fixed Part
Constant
(Age-gm)
2
(Age-gm)
Girl
(Age-gm).Girl
2
(Age-gm) .Girl
Rural
Girl.Rural
Random Part
Schools
Constant/Constant
Level: Child
Constant/Constant
(Age-gm)/Cons
(Age-gm)/(Age-gm)
Occasion
Constant/Constant
(Age-gm)/Constant
(Age-gm)/(Age-gm)
Deviance
DIC:
3 level
RIGLS
Endur
S.E.
3 level
MCMC
Endur
S.E.
Median
CI(2.5%)
CI(97.5%)
ESS
18.112
0.854
0.008
-2.419
-0.439
-0.056
3.048
-2.117
0.400
0.029
0.011
0.489
0.040
0.016
0.448
0.629
18.112
0.854
0.008
-2.422
-0.439
-0.056
3.054
-2.106
0.404
0.029
0.011
0.494
0.041
0.016
0.448
0.636
18.110
0.854
0.008
-2.425
-0.439
-0.056
3.054
-2.106
17.326
0.797
-0.013
-3.397
-0.519
-0.087
2.171
-3.342
18.912
0.911
0.030
-1.456
-0.359
-0.025
3.926
-0.856
3386
38403
41415
4227
37934
41315
3913
3908
1.276
0.504
1.339
0.583
1.252
0.467
2.729
3812
5.732
0.115
0.005
0.577
0.056
0.013
5.829
0.118
0.007
0.597
0.049
0.004
5.794
0.118
0.006
4.755
0.021
0.002
7.096
0.216
0.017
18065
745
471
4.541
-0.086
-0.009
6774.3
0.306
0.038
0.033
4.557
-0.087
-0.006
0.317
0.039
0.031
4.543
-0.087
-0.006
3.973
-0.165
-0.066
5.227
-0.012
0.055
1617
5617
1477
6429.809
- 157 -
Question 11 what do you conclude from these results; has the monitoring chain
been run for a sufficiently long length; are there differences between the results
obtained from the two types of estimation? Is the three level model an in
improvement on the two-level model
There appears to be some evidence that there are differences between schools and we can
examine the ‘caterpillar’ plot to characterise the size and nature of these differences.
Model on main menu
Residuals
In Settings sub- window
Start output at column c300
1.42 SD (comparative) of residual to c301 [to compare pairs of schools]
Level 3: Schools
Calc
In Plots sub- window
Choose residual +/- 1.42 sd x rank
Apply
After modifying the title and changing the scale on the horizontal axis, we obtain the
following plot.
The extremes add some 150 metres (School 15 and 19) and subtract 210 metres (School 20
and 6) from the overall average. These estimates are derived formulaically from the
- 158 -
between-school variance; we could also have stored the residuals during the MCMC
estimation and examined their distribution.54
We can also calculate the proportion of the variance that is due to differences
between schools. We do this using the MCMC estimates as they have better properties and
do so for the average- aged child (to avoid the complication of the random slopes at level 1
and level 2) :
(32)
Calc b1 = 1.339/ (1.339 + 5.829 + 4.557)
0.11420
Some 11 per cent of the variation lies at the school level or alternatively randomly selected
pairs of pupils from the same school will be correlated 0.11. Unfortunately, we do not have
any variables measured at the school level (e.g. the physical exercise regime) so we are
unable to proceed to try and account for this variation.
A digression on orthogonal polynomials
We break off from our sequence of developing the model at this point for pedagogic
reasons. The models we have so far fitted have used continuous time in the form of age. It
is not uncommon however, to only have discrete time such as the calendar year the period
in which the measurement was taken. In effect the measurement is ordinal with 5 being
later than 4 which is later than 3 and so on. When this is the case there are advantages in
using a transform of this discrete time, here the Occasion variable, in both the fixed and the
random part of the model, which is known as an orthogonal polynomial. This transformation
can be linear, quadratic, cubic etc. thereby allowing a straight line relationship, a single
curve, a curve with two bends etc. The resultant sets of new variables have a mean of zero
and are orthogonal to each other which makes for much more stable estimation. Due to
these properties they are highly recommend for the analysis of trends in repeated measures
models.55 Here we will pretend that we do not have data in continuous time and delete the
Age terms and all interactions involving Age to get the following model (you will have to use
RIGLS estimation and not MCMC to re-specify the model).
54
This can only be done before start of the MCMC procedure: Model on main menu.; MCMC; Store residuals.
The resultant long vector will have to be ‘unsplit’ to the get the estimates for each school’ see Chapter 10.
55
Orthogonal polynomials are discussed in more detail in the MLwiN User manual supplement and in
Hedeker, D. and Gibbons, R.D. (2006) Longitudinal Data Analysis. New York: Wiley.
- 159 -
We will also clear the current model estimates before including the orthogonal polynomials.
Model on main menu
Manage stored models
Clear all
Then we will introduce the 1st order polynomial of Occasion
Model on main menu
Equations window
Click Add term at the bottom of the window
Choose variable to be Occasion
Tick on Orthogonal polynomial (this option is only permitted if the variable, as here,
is categorical;)
Choose degree to be 1 for a linear relation between pressure and occasion
Done
Start to convergence
Store on bottom tool bar, naming the model Orth1
For speed in this exploratory phase we will use the RIGLS estimates and not the MCMC
ones. We now modify the equations to be 2nd order ones, and then the 3rd order
polynomials and the 4th order storing the estimates as we go. The 4th order polynomial is the
most complicated form that we can fit to this data that has 5 occasions; it is equivalent to to
fitting a separate dummy variable term for each occasion contrasted against a base
category. We hope that the model can be made more parsimonious than this, that is we
hope to find insignificant higher order polynomials. The orthogonal polynomial readily
permits such flexibility whereas a dummy variables approach for each and every occasion
does not.
Equations window
Click on orthog_Visits^1 and choose to Modify term
- 160 -
Choose degree to be 2 for a quadratic relation between pressure and occasion
Done
More to convergence
Store on bottom tool bar, naming the model Orth2
Before proceeding to look at the estimates it is worth examining the orthogonal
polynomials themselves. We can first view them in relation the untransformed Occasions
variable
and then calculate their means, SD’s and correlations
Means
orthog_Occ^1
-0.0075663
orthog_Occ^2
0.012413
orthog_Occ^3
0.00066945
orthog_Occ^4
-0.00076766
S.D.'s
orthog_Occ^1
0.45247
orthog_Occ^2
0.44861
orthog_Occ^3
0.44619
orthog_Occ^4
0.44191
Correlations
orthog_Occ^1 orthog_Occ^2
orthog_Occ^1 1.0000
orthog_Occ^2 -0.0124
1.0000
orthog_Occ^3 0.0189
-0.0125
orthog_Occ^4 0.0008
0.0143
orthog_Occ^3
orthog_Occ^4
1.0000
-0.0104
1.0000
Question 12 what are the (approximate) characteristics of these orthogonal
polynomials?
- 161 -
Turning now to the 4 sets of estimates we can compare the models.
Orth1
Fixed Part
Constant
Girl
Rural
Girl.Rural
orthog_Occ^1
orthog_Occ^2
orthog_Occ^3
orthog_Occ^4
Random Part
Level: Schools
Constant/Constant
Level: Child
Constant/Constant
Level: Occasion
Constant/Constant
-2*loglikelihood:
S.E.
Orth2
S.E.
Orth3
S.E.
Orth4
S.E.
18.216
-2.862
3.031
-2.132
3.937
0.402
0.482
0.460
0.645
0.133
18.217
-2.860
3.033
-2.133
3.933
-0.281
0.402
0.482
0.460
0.645
0.133
0.134
18.220
-2.862
3.029
-2.129
3.926
-0.277
0.302
0.402
0.482
0.460
0.645
0.133
0.134
0.135
18.220
-2.861
3.027
-2.128
3.925
-0.280
0.303
0.169
0.402
0.482
0.459
0.645
0.133
0.134
0.135
0.137
1.410
0.547
1.413
0.547
1.412
0.547
1.410
0.546
5.865
0.601
5.866
0.603
5.874
0.603
5.867
0.602
5.131
6940.846
0.217
5.116
6936.471
0.216
5.098
6931.506
0.216
5.097
6929.978
0.216
Comparing the deviance of the models
Comparison
1st and 2nd order
2nd and 3rd order
3rd and 4th order
Diff in ChiDev1
Dev2
square
6940.846 6936.471
4.375
6936.471 6931.506
4.965
6931.506 6929.978
1.528
Diff
in
Df
1
1
1
p value
0.036
0.026
0.216
We see that that by conventional standards the 2nd and 3rd order polynomials are needed
but not the fourth order. Looking at the specific estimates of the 4 th order model (3.925, 0.280, 0.303, and 0.169) and remembering that the estimates are directly comparable (the
orthogonal variables have approximately the same standard deviation) we can see that the
linear term dominates. The easiest way to interpret the estimates is as a plot of the fixed –
part predictions. Here are the results for the 3rd order polynomial for urban boys plotted
against Year. The strong underlying linear increase is very evident.
- 162 -
The model can be developed in the same manner as for continuous time to include
interactions with Sex and Rurality and to include random slopes at each level.
Elaborating the random part of the model: accommodating temporal
dependence
Compound symmetry
The majority of our elaborations of the continuous time model have concerned the fixed
part of the model and with the exception of the random slopes model we have been
content to estimate a compound symmetry model in which there is only one variance which
implies that the correlation between Endurance measurements for a child is constant and
unchanging with how far apart in time the measurements were taken. We now turn to the
random part and fit a number of different structures. As is common, we do not have a
strong theory for specifying the exact form of the residual dependency over time, so we will
fit a series of models of differing parsimony to assess empirically an appropriate form. We
are going to operate without a term for between-school differences and any terms for
random slopes as we want to concentrate on the nature of the dependency over occasion. If
we return to Model 8 with the complex fixed part involving the 2nd order polynomial of time
and the Sex and Rurality interactions we get the following results using RIGLS. We could of
course compare these models with a random slopes model to see which the better
empirical fit is.
- 163 -
This random intercepts has a compound symmetry correlation structure so that any
pair of occasions has the same degree of correlation. We have stored the estimates of this
model as CS. Using the Command Interface we can calculate the correlation to be
calc b1 = 6.987/ (6.987 + 4.519)
0.60725
and we can use this information to begin building a table of the degree of correlation for
each pair of occasions identifying different lags. In this model, the same correlation is
imposed at each and every lag.
Pair of
Occasions CS
Lag
1 and 2
0.61
2 and 3
0.61
3 and 4
0.61
4 and 5
0.61
2 and 4
0.61
3 and 5
0.61
1 and 4
0.61
2 and 5
0.61
1 and 5
0.61
1
1
1
1
2
2
3
3
4
Unstructured: the multivariate model
The next model to be fitted is the least parsimonious model that is possible for these data,
an ‘unstructured’ model. The distinguishing feature is that a separate variance for each
occasion is allowed and a separate covariance (and therefore correlation) between each and
every pair of visits.
Equations window
Click Add term
Choose the Variable to be Occasion and select None as the reference category
- 164 -
Click on each of the four created dummies in turn and tick off the fixed parameter,
and tick on the j(Child) differential random term
Click on the Constant and tick off the differential for j(Child ) to avoid
multicollinearity; Done
Start to convergence
Store on bottom tool bar, naming the model UN
After convergence the estimates are as follows.
There are now estimates for each of the five occasions at level 2 and there is potentially a
different covariance between each pair of occasions. The variance at level 1 is zero. This is to
be expected as we are estimating a differential for each child at each occasion at level 2,
there is no unexplained variation left at level 1 in this saturated model.56 We can use the
Estimates Tables at level 2 (accessed through Model on Main menu) to examine the
conditional correlations of the children’s’ Endurance at different occasions over time.
and update our table
56
We are in fact fitting a multivariate multilevel model where the response is the stacked endurance and level
1 exists solely to define the multivariate structure (Goldstein, H, 2011, Multilevel statistical models, Chapter 6:
th
Multivariate multilevel data, 4 edition, Arnold, London).
- 165 -
Pair of
Occasions CS
Un
Lag
0.55
1 and 2
0.61
0.62
2 and 3
0.61
0.63
3 and 4
0.61
0.67
4 and 5
0.61
0.61
2 and 4
0.61
0.65
3 and 5
0.61
0.58
1 and 4
0.61
0.62
2 and 5
0.61
0.60
1 and 5
0.61
1
1
1
1
2
2
3
3
4
and then plot both estimates against the lag.
There is roughly the same degree of correlation of about 0.6 between any pair of occasions
and the only barely discernable pattern is that the correlations between occasions after the
first is a little bit larger. The unstructured formulation has the advantage of uncovering the
nature and degree of dependence of the Endurance measurements over time conditional on
the fixed part variables. Indeed it would be possible to discern quite complex regime change
such as before and after puberty but there also real dangers of over-fitting in that the
covariance matches the observed pattern exactly but may not be the true pattern. These
are the estimates for the compound symmetry and unstructured models.
- 166 -
CS
Fixed Part
Constant
(Age-gm)^1
(Age-gm)^2
Girl
(Age-gm)^1.Girl
(Age-gm)^2.Girl
Rural
Girl.Rural
Between Child
Constant/Constant
Occ1r/Occ1r
Occ2r/Occ1r
Occ2r/Occ2r
Occ3r/Occ1r
Occ3r/Occ2r
Occ3r/Occ3r
Occ4r/Occ1r
Occ4r/Occ2r
Occ4r/Occ3r
Occ4r/Occ4r
Occ5r/Occ1r
Occ5r/Occ2r
Occ5r/Occ3r
Occ5r/Occ4r
Occ5r/Occ5r
Between Occasion
-2*loglikelihood:
S.E.
Un
18.115
0.856
0.010
-2.473
-0.443
-0.059
3.056
-2.044
0.359
0.028
0.011
0.510
0.040
0.016
0.465
0.657
6.987
0.651
4.519
6802.133
0.191
S.E.
18.084
0.857
0.009
-2.441
-0.437
-0.056
3.144
-2.118
0.357
0.028
0.011
0.507
0.040
0.015
0.456
0.643
10.141
6.247
12.724
5.898
7.209
10.763
6.427
7.588
7.186
12.301
6.326
7.377
7.053
7.790
11.104
0.000
6785.074
0.828
0.756
1.060
0.704
0.814
0.908
0.754
0.865
0.806
1.033
0.718
0.821
0.768
0.827
0.920
0.000
Question 13: what are the differences between the fixed estimates and their
standard errors in the two models? Is the more complicated unstructured model a
significantly better fit to the data?
Toeplitz as a constrained model
Substantively at this point we would have completed our analysis as the ‘sensitivity’
analysis of the unstructured model has found that there are not distinctive patterns of
dependency that require modelling beyond the compound symmetry dependency ( see the
answers to Q13). However in the interest of pedagogy we are going to continue modelling
as you often find in other data sets a distinctive form of dependency such as portrayed in
the figure below in which the dependency reduces the greater the lag so that occasions that
are further apart are less correlated. Consequently we fit a further set of models
- 167 -
The next model to be fitted is theToeplitz structure and this is more parsimonious than
the unstructured covariance (with its 15 separate random parameters) but less
parsimonious than the compound symmetry model (with its 2 parameters). The distinctive
feature of the Toeplitz is that the same correlation is imposed when pairs of occasions are
separated by the same lag.57 This gives a banded form of structure where pairs 1 and 2, 2
and 3, and 3 and 4, 4 and 5 are constrained to have the same correlation as they are lag 1
apart, while pairs 2 and 4 and 3 and 5 are constrained to have the same but potentially
different correlation reflecting that they are 2 lags apart. It is clear, therefore, that the
Toeplitz structure is a constrained version of the unstructured model and this permits one
form of estimation that can be used in MLwiN.
The number and nature of the constraints can be appreciated from the following table,
where the letters (A, B, etc.) signify the estimates that must be the same for the Toeplitz
assumptions to apply.
Occ1
Occ2
Occ3
Occ4
Occ5
Occ1
A
B
C
D
E
Occ2
Occ3
Occ4
Occ5
A1
B1
C1
D1
A2
B2
C2
A3
B3
A4
Consequently, the constraints for a homogenous Toeplitz model are:58
57
Named after Otto Toeplitz (1881–1940); a Toeplitz matrix has the structure such that each descending
diagonal from left to right is constant.
58
In MLwiN it is not possible to fit a ‘heterogeneous’ Toeplitz by un-constraining the variances when RCON
has been used. While this would result in banded covariances this would not result in banded correlations as
the latter is calculated in relation to the respective variances. MLwiN does not have specific parameters for
correlations, only for covariances.
- 168 -
Variances:
Lag 1:
Lag 2:
Lag 3:
4 constraints as the later variances have to be constrained to the variance of
occasion 1 (A);
3 constraints as 3 other covariances one lag apart are constrained to the
covariance of Occasion 1 and 2 (B)
2 constraints as 2 other covariances two lags apart are constrained to the
covariance of Occasion 1 and 3 (C)
1 constraints as 1 other covariance three lags apart are constrained to the
covariance of Occasion 1 and 4 (D)
This gives a total of 10 constraints.
Model on Main menu
Constrain parameters
Choose random part
Number of constraints to be 10
Put a 1 for the parameter to be involved in the constraint, put a -1 to be involved as
a difference, put a 0 as the value for the difference to be equal to; the first four
constraints involve the variances; the next six constrain the covariances such that
the differences are zero;59
Store constraint matrix for random parameters in c200
Attach random constraints
More to convergence; Store Model as Toep
59
The process is akin to the intervals and tests procedure, instead of testing against a difference of zero, you
are constraining a difference to be zero.
- 169 -
The results show the same variance and the banded form of the covariance
as do the correlations, which we will include in our table with the other estimates
Pair of
Occasions
1 and 2
2 and 3
3 and 4
4 and 5
2 and 4
3 and 5
1 and 4
2 and 5
1 and 5
CS
UN
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.55
0.62
0.63
0.67
0.61
0.65
0.58
0.62
0.60
Toeplitz
0.609
0.609
0.609
0.609
0.604
0.604
0.597
0.597
0.605
Lag
1
1
1
1
2
2
3
3
4
Question 14: is the Toeplitz model a significantly better fit than the compound
symmetry model? What are the implications of this result?
___________________________________________________________________________
- 170 -
Toeplitz with SETD
It is useful at this point to estimate the Toeplitz model in another way, by starting with the
compound symmetry model of Model 8 and by including the additional covariances by
imposing a structured design matrix based on the lags on the random part of the model in
addition to existing variances. 60 We first have to create the lag structure.
 (i1 , i 2) j | t i1 j  t i 2 j |
(33)
where tij is occasion of the i’th measurement for the j the person. These form a symmetric
matrix of dimension 5 (the 5 occasions) for each and every Child j which has zeroes on the
main diagonal and the off-diagonal terms are the time differences which here are simply
lags. After Opening Model 8 worksheet, this matrix is created via the Command Interface
using the SUBS command which produces, in long form, a stacked half-symmetric matrix.
The relevant form of the command here is
Subs ‘Child’ -1 ‘Occasion’ ‘Occasion’ c210
where ‘Child’ defines the level , -1 results in the elements on the major diagonal of each
Child matrix being set to zero; the values in the first ‘Occasion’ are subtracted from the
second ‘Occasion’ to give the lag differences and the resultant stacked matrix is stored in
c210, for use as a design vector for the random parameter. To view the stacked lowertriangular matrix for each Child, give the Command
Mview ‘Child’ c210
and in the Output window, you will get a long listing of a matrix for each and every child. It
is worth looking at these and the extract below gives the last three
BLOCK ID : 298
1
2
3
1
0
2
3
Child 298 who was only measured on 3 occasions
2
3
0
1
0
60
It may be easier to appreciate the nature of the SETD function if we look at a multilevel model in its mixed
model formulation: Y  X  ZU ; where Y is a vector of response,  is the vector of unknown fixed part terms,
X is the matrix of predictor variables, U is a vector of unknown random effects and Z is the specified design
matrix for the random effects. In a single level model this design matrix is an Identity matrix with a 1 on the
main diagonal (to obtain the parameter  e 0 ) and zeros elsewhere, which means that there is no covariance
2
between units. It is this design matrix that is specified in a very flexible way when the SETD command is used.
Here, to achieve a Toeplitz structure, four sets of block diagonal structures, reflecting a lag of 1, 2 3, and 4 are
imposed.
- 171 -
BLOCK ID : 299
1
2
3
4
5
1
0
1
2
3
4
BLOCK ID : 300
1
2
3
4
1
0
1
2
4
Child 299 who was measured on all 5 occasions
2
3
4
5
0
1
2
3
0
1
2
0
1
0
Child 300 who was only measured on 4 occasions
2
3
4
0
1
3
0
2
0
It is clear that this block diagonal matrix gives the lag indicator of each observed occasion
for each child so that child 300 is not measured on occasion 4 (there is no lag that is 3 away
from occasion 1).
Staying in the Command Interface, we can create four sets of block diagonal matrices, one
for each lag, with a 1 signifying that the observation is involved in the variance-covariance
term, 0 otherwise, view them and then apply them
Name c211 ‘LagOne’ c212 ‘LagTwo’ c213 ‘LagThree’ c214 ‘LagFour’
Change 2 c210 0 ‘LagOne’
create lag1 indicator
Change 3 ‘LagOne’ 0
‘LagOne’
Change 4 ‘LagOne’ 0
‘LagOne’
Mview ‘Child’ ‘LagOne’
check it
Here is the matrix for Child 299 with a complete record, we see that Occasions only 1 lag
apart have a 1
BLOCK ID : 299
1
2
3
4
5
1
0
1
0
0
0
2
3
4
5
0
1
0
0
0
1
0
0
1
0
Change 1 c210 0 ‘LagTwo’
Change 2 ‘LagTwo’ 1 ‘LagTwo’
Change 3 ‘LagTwo’ 0 ‘LagTwo’
Change 4 ‘LagTwo’ 0 ‘LagTwo’
Mview ‘Child’ ‘LagTwo’
create lag2 indicator
check it
Here is the matrix for Child 299, we see that only occasions 2 lags apart have a 1.
- 172 -
BLOCK ID : 299
1
0
0
1
0
0
1
2
3
4
5
2
3
4
5
0
0
1
0
0
0
1
0
0
0
Change 1 c210 0 ‘LagThree’
Change 2 ‘LagThree’ 0 ‘LagThree’
Change 3 ‘LagThree’ 1 ‘LagThree’
Change 4 ‘LagThree’ 0 ‘LagThree’
Mview ‘Child’ ‘LagThree’
create lag3 indicator
check it
Change 1 c210 0 ‘LagFour’
Change 2 ‘LagFour’ 0 ‘LagFour’
Change 3 ‘LagFour’ 0 ‘LagFour’
Change 4 ‘LagFour’ 1 ‘LagFour’
Mview ‘Child’ ‘LagFour’
create lag4 indicator
check it
Here is the matrix for Child 299 for lag 4, we see that only occasion 1 and 5 which are 4 lags
apart have a 1.
BLOCK ID : 299
1
0
0
0
0
1
1
2
3
4
5
2
3
4
5
0
0
0
0
0
0
0
0
0
0
We now have to impose the created design matric at the child level
Setd
Setd
Setd
Setd
2
2
2
2
‘LagOne’
‘LagTwo’
‘LagThree’
‘LagFour’
impose the design matrices61
Start to convergence
Store the estimated model as ToepSD
To get the estimates of the new terms (they are not shown in the Stored models nor in the
61
If there are any missing values for the response variable, you will get an error that the design matrix is not of
the correct size. This is why we used the Listwise deletion command earlier. Notice that dependency is
required between occasions, that is level 1, but the design matrices are imposed at level 2 for this is how we
impose similarity within in MLwiN. To remove a design structure, use a command such as CLRDesign 2
c220.
- 173 -
Equations graphical interface), we have to give the command Rand in the Command
Interface, which will display the following in the Output window
LEV. PARAMETER
(NCONV)
ESTIMATE
S. ERROR(U) PREV. ESTIM
--------------------------------------------------------------------------2
Constant /Constant ( 2)
6.937
0.6566
6.941
2
LagOne
*
( 0)
0.1007
0.191
0.0996
2
LagTwo
*
( 1)
0
0
0
2
LagThree *
( 3)
0
0
0
2
LagFour *
( 0)
0.05778
0.3299
0.05358
--------------------------------------------------------------------------1
Constant /Constant ( 1)
4.566
0.2137
4.563
The term associated with the Constant at level 2 gives the variance between children, while
the term associated with the Constant at level 1 gives the variance within-children,
between-occasions. The four new terms associated with the Lag variables give the
covariance for each lag. Consequently, the degree of correlation can be calculated as
 u20   Lag
Corr ( Lagm )  2
 u 0   e20
m
(34)
To derive the correlations at lag1, lag2, lag 3, and lag 4 use the Command Interface
calc b1 = (6.937+ 0.1007) / (6.937 + 4.566)
0.61181
Lag One
calc b1 = (6.937+ 0.0) / (6.937 + 4.566) Lag Two & Three
0.60306
calc b1 = (6.937+ 0.05778) / (6.937 + 4.566)
0.60808
Lag Four
and update the table of results for dependency with these values.
Pair of
Occasions
1 and 2
2 and 3
3 and 4
4 and 5
2 and 4
3 and 5
1 and 4
2 and 5
1 and 5
CS
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
UN
0.55
0.62
0.63
0.67
0.61
0.65
0.58
0.62
0.60
Toeplitz_
Constrained
0.61
0.61
0.61
0.61
0.60
0.60
0.60
0.60
0.61
- 174 -
Toeplitz
SETD
0.61
0.61
0.61
0.61
0.60
0.60
0.60
0.60
0.61
Lag
1
1
1
1
2
2
3
3
4
On completing the summary table; it is clear that exactly the same results have been found
for the Toeplitz model by imposing constraints on the unstructured model thereby reducing
the number of random parameters from 15 to 5, as through adding 3 additional terms to
the compound symmetry increasing the number of parameters from 2 to 5.62 If you
compare the deviance of the two versions of the Toeplitz model you will also see that these
models are indistinguishable, both having the value 6802. Consequently, neither version is a
significant improvement over the simpler compound symmetry form.
Autoregressive weights model using lags
The next model, the autoregressive weights model, has one more parameter than the
compound symmetry model. The dependency that we want to impose in this model is such
that the covariance decreases as the time ‘distance’, here the lag, between measurements
increases. We are again going to do this by imposing a structured design matrix on the
random part of the model in addition to the existing compound symmetric structure of
Model 8. In a 2 level model of repeated measures within individuals, such a time-dependent
structure can be defined as follows:
cov(ei1 j , ei 2 j )  
| ti1 j
1
 ti 2 j |
(35)
the covariance for person j at two occasions i1 and i2 depends on a ‘distance’ decay or
inverse weight function where  , the autoregressive parameter, is to be estimated and tij
is time of the i’th measurement for the j the child.63
Returning to the specification of Model 8, we first have to create the time difference
structure64
 (i1 , i 2) j | t i1 j  t i 2 j |
(36)
These form a symmetric matrix of dimension 5 (the 5 occasions) for each and every child j
which has zeroes on the main diagonal and the off-diagonal terms are the time differences
which here are simply lags. As before this is created in the Command Interface using the
62
Not all four terms can be estimated at level 2 as there is a linear dependency in the terms and one of the
terms must be estimated to be zero; it was included here for pedagogical purposes. The total number of
estimatable random terms in the Toeplitz model equals the number of occasions, in this case 5. Consequently
one of the estimates is a structural zero and another is an estimated zero.
63
This distinction between occasion and time, although not needed here, permits a very general specification
where time could be continuous so that for some children the second occasion is 1.1 years later while for
others it is 1.9 years later.
64
We could also start with ToepSD and then clear the design matrix completely with the commands CLRD 2
‘LagOne’; CLRD 2 ‘LagTwo’ CLRD 2 ‘LagThree’, CLRD 2 ‘LagFour’.
- 175 -
command SUBS which produces, in long form, a stacked half-symmetric matrix. As before,
the relevant form of the command here is
Subs ‘Child’ -1 ‘Occasion’ ‘Occasion’ c210
where ‘Child’ defines the level of the  parameter, -1 results in the elements on the major
diagonal of each child matrix being set to zero; the values in the first ‘Occasion’ are
subtracted from the second ‘Occasion’ to give the time differences and the resultant
stacked matrix is stored in c210.
Staying in the Command Interface we now need to calculate the inverse weights
| ti1 j
1
 ti 2 j |
we have to change the zeros to -1 to avoid a zero divide before we calculate the weights,
and then set the diagonal back to zero once the weights have been calculated.
Change 0 c210 -1 c210
calc c220=1/c210
change -1 c220 0 c220
Mview ‘Child’ c220
The Output window will show the inverse weights which are now stored in c220, here are
the values for child 299 who was measured on all 5 occasions
BLOCK ID : 299
1
2
3
4
5
1
0
1
0.5
0.33333
0.25
2
3
4
5
0
1
0.5
0.33333
0
1
0.5
0
1
0
Clearly, occasions that are further apart have a smaller weight. The single set of weights is
now imposed on the model as a design matrix which structures the random part of the
model. Again in the Command Interface
SetDesign 2 c220
Start to convergence
Store the estimated model as Auto
- 176 -
You will notice that the estimate for the  parameter is not included in the comparison table
and it is not included in the equations window. To see the estimate we have to issue the
following in the Command Interface.
Rand
which will display the random parameter estimates in the Output window
LEV. PARAMETER
(NCONV)
ESTIMATE
S. ERROR(U) PREV. ESTIM
--------------------------------------------------------------------------2
Constant /Constant ( 1)
6.871
0.6788
6.851
2
c220
*
( 0)
0.1703
0.3173
0.1616
--------------------------------------------------------------------------1
Constant /Constant ( 1)
4.629
0.2836
4.627
The estimate for  is 0.1703. To appreciate the meaning of this value we can generate a
small data set of the lags and then calculate the correlation structure. In the Command
Interface
gene 1 4 c100
generate the lag 1, 2 and 4
calc c101 = (1/c100)
calculate the weights
calc c102 = (0.1703 * c101)
covariance as a function of  and the inverse weights
calc c103 = c102 + 6.871
add in compound symmetry element
calc c104 = c103/(6.871 + 4.629)
calculate correlation elements
print c101-c104
print out results
N =
1
2
3
4
c101
4
1.0000
0.50000
0.33333
0.25000
c102
4
0.17030
0.085150
0.056767
0.042575
c103
4
7.0413
6.9562
6.9278
6.9136
c104
4
0.61229
0.60488
0.60241
0.60118
We can see that the autoregressive weights model based on lags has imposed a slightly
declining dependency with lag. Updating out table we get the following results.
Pair of
Occasions
1 and 2
2 and 3
3 and 4
4 and 5
2 and 4
3 and 5
1 and 4
2 and 5
1 and 5
CS
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
0.61
UN
0.55
0.62
0.63
0.67
0.61
0.65
0.58
0.62
0.60
Toeplitz_
Constrained
0.61
0.61
0.61
0.61
0.60
0.60
0.60
0.60
0.61
- 177 -
Toeplitz
SETD
0.61
0.61
0.61
0.61
0.60
0.60
0.60
0.60
0.61
Auto
Lag
0.61
0.61
0.61
0.61
0.60
0.60
0.60
0.60
0.60
Lag
1
1
1
1
2
2
3
3
4
If we compare the deviance for the compound symmetry and the autoregressive weights
models there is a very small difference which with 1 degree of freedom (due to the
additional  parameter) which is associated with a high insignificant p value. Again we have
no reason to reject the simpler CS model.
This autoregressive weights model can be modified in a number of ways.65 Thus we
could use the inverse of the squared lags to get a steeper decline in the dependency (the
unstructured model may have suggest this) or we could use, instead of the lags, the
distances apart in continuous time. We will illustrate the latter. Starting again with Model 8
we first get the time distances apart between measurement occasions.
Subs ‘Child’ -1 ‘Age’ ‘Age’ c210
Mview ‘Child’ c210
Here are the time intervals for child 299 who was measured on all 5 occasions.
BLOCK ID : 299
1
2
3
4
5
1
0
1.5
3.4
5.4
7.6
2
3
4
5
0
1.9
3.9
6.1
0
2
4.2
0
2.2
0
So that occasion 1 and occasion 5 were 7.6 years apart. We again have to change the zeros
to -1 to avoid a zero divide before we calculate the weights, and then set the diagonal back
to zero once the weights have been calculated.
Change 0 c210 -1 c210
calc c220=1/c210
change -1 c220 0 c220
Mview ‘Child’ c220
The weights for child 299 are then
BLOCK ID : 299
1
2
3
1
0
0.66667
0.29412
2
3
0
0.52632
0
65
4
5
In many ways the autoregressive weights procedure is a more flexible procedure that AR(1) models where
the lag 2 correlation is the lag 1 correlation squared, and the lag 3 correlation is the lag 1 correlation cubed.
There are MLwiN macros available for fitting AR(1) models as part of a very general specification but they have
not been updated to work with version 2.1 as they have been found to be rather unstable and sensitive to the
declared and required starting value for the autoregressive parameter. The macros can still be found at
http://www.cmm.bristol.ac.uk/MLwiN/download/D-1-10/index.shtml; they need updating not least because
the random parameter estimates are no longer stored in c96 but in c1096.
- 178 -
4
5
0.18519
0.13158
0.25641
0.16393
0.5
0.2381
0
0.45455
0
These are imposed on the model as a design matrix, and the following commands given
SetDesign 2 c220 in the Command window
Start to convergence in the Main window
Store the estimated model as AutoCont in the Equations window
Rand in the Command window
To obtain the results
LEV. PARAMETER
(NCONV)
ESTIMATE
S. ERROR(U) PREV. ESTIM
--------------------------------------------------------------------------2
Constant /Constant ( 1)
6.948
0.6761
6.952
2
c220
*
( 0)
0.1186
0.5944
0.01208
--------------------------------------------------------------------------1
Constant /Constant ( 1)
4.558
0.2718
4.527
Again we need to generate a short set of weights to see how the dependency changes with
the distance between observations, here we will use 1 to 9 years apart
gene 1 9 c100
generate the time intervals
calc c101 = (1/c100)
calculate the weights
calc c102 = (0.1186 * c101)
covariance as a function of  and the
inverse weights
calc c103 = c102 + 6.871
add in compound symmetry element
calc c104 = c103/(6.948 + 4.558)
calculate correlation elements
print c100-c104
print out results
c100
N =
1
2
3
4
5
6
7
8
9
9
1.0000
2.0000
3.0000
4.0000
5.0000
6.0000
7.0000
8.0000
9.0000
c101
9
1.0000
0.50000
0.33333
0.25000
0.20000
0.16667
0.14286
0.12500
0.11111
c102
9
0.11860
0.059300
0.039533
0.029650
0.023720
0.019767
0.016943
0.014825
0.013178
c103
9
6.9896
6.9303
6.9105
6.9007
6.8947
6.8908
6.8879
6.8858
6.8842
c104
9
0.60747
0.60232
0.60060
0.59974
0.59923
0.59888
0.59864
0.59846
0.59831
So that the degree of dependency hardly changes at all and it is not surprise that the
continuous time autoregressive weights model with its extra parameter is not significantly
different from the compound symmetry model.
MCMC estimation of complex dependency
The models in this section have so far all been estimated by IGLS/RIGLS procedures because
the MCMC implementation does not currently recognize any user-defined constraints that
have been set or any weights matrix imposed by the SETD command. Of course the
- 179 -
compound symmetry and the multivariate unstructured models can be estimated by both
maximum likelihood and MCMC procedures. Moreover it is also possible to fit an explicit
multivariate model that does not have any term for level 1 variation and then impose one of
a number of alternative ‘correlational’ structures at level 2. These are:





‘Full covariance matrix’ – the unstructured multivariate model;
‘All correlations equal/all variances equal’ - the homogenous compound symmetry
model;
‘All correlations equal/independent variances’ - the heterogeneous compound
symmetry;
‘AR1 structure, all variances equal’ -the homogenous 1st order autoregressive model;
‘AR1 structure/independent variances’ – the heterogeneous 1st order autoregressive
model.
However, there is no Toeplitz model available in MCMC and we have to ‘trick’ MLwiN into
modelling time-varying variables as the way the multivariate model is constructed from the
‘wide’ data means that there is no way of indicating that particular predictor variables
belongs to a set of variables, that is Age 1 to Age5.
We first fit an unstructured formulation as a multivariate model and then impose
restrictions on the nature of the dependency between occasions. Begin retrieving the
original Wide data
File on main menu
Open worksheet
endurancewide.wsz
Equations window
Clear on bottom toolbar to remove any existing model
Estimation control to RIGLS
click on Responses in the bottom toolbar,
highlight all the variables representing the 5 Endurance measures, and then Done.
(This must be done in the order of End1, End2, … to End5.)
This will have started to create a sensu-stricto multivariate model in the Equations window
with 5 responses; we now need to set up the random effects structure at level 2 for children
as there will be no level 1 variance in this saturated model.
Equations window
Click on any of the responses
N- level 2- ij
Level 2(j) Child
leaving level 1 to be the response indicator
Done
- 180 -
Add term
Variable: choose the Constant
Add separate coefficients to create a dummy for each response
Click on each Response variable in turn, click off the fixed parameter and click
on j(Child_long) to build the level 2 variance covariance matrix.
Add term
Variable: Constant
Add Common coefficient to create an overall constant
Include all
Done
The specification of the model should now look like
There is an overall mean value (β5 ), 5 variances, one for each occasion, and 10 covariances
between the occasion; the hj formulation allows a common fixed effects model at the child
level. On estimation we get the following unconditional results.
- 181 -
The mean Endurance across all children and all occasions is 15.9 and the variance at
occasion one is 14.1. The variance increases with occasion so that at Occasion 5 the variance
of 41.9 gives substantially bigger differences between children. It is worth looking at this
point at the worksheet via the Names window to see that MLwiN has created a number of
new variables of length 1500 rows. Viewing these variables reveals the endurance values
have been stacked in to a long column Resp, with an indicator variable picking out which
occasion and which child the particular values belong to.
We now need to add in the fixed part terms of the model and we shall aim the reproduce
Model 8, the most complex fixed effects model supported by the data. This is
straightforward for the time-invariant child level variables but more complex for the time
varying ones.
Equations window
- 182 -
Add term
Variable: Sex
Reference category: Boy
Add Common coefficient to create an overall Girl dummy for all occasions
Include all
Done
Add term
Variable: Rurality
Reference category: Urban
Add Common coefficient to create an overall Rural dummy for all occasions
Include all
Done
Add term
Order 1
Variable: Sex
Variable: Rurality
Add Common coefficient to create an overallGirl* Rural dummy
Include all
Done
We now start on the time varying age variables.
Equations window
Add term
Variable: Age1
Centred around grand mean
Add Common coefficient
Include all
Done
Add term
Variable: Age2
Centred around grand mean
Add Common coefficient
Include all
Done
And the Sex-Age interactions
Add term
Order 1
Variable: Age1
- 183 -
Variable: Sex
Add Common coefficient
Include all
Done
Add term
Order 1
Variable: Age2
Variable: Sex
Add Common coefficient
Include all
Done
To create the following specification
Unfortunately Age 1-gm and Age 2-gm are not the variables that we want for they
represent time-invariant variables – the replicated values of how old each child was on
occasion 1 and occasion 2. Looking at the equations window you can see the subscript j not
the required ij. The trick is to create the variables of the form we want outside of the
Equations window and then replace the specified values. We first need to create a long
vector of the Age variables and store in a free column; here c37
Data Manipulation in main Menu
Join
- 184 -
If you inspect this variable you will not find it in the order we need of age nested within
children but the Ages of occasion 1 stacked on occasion 2 etc. So we have to create stacked
child variable in c38 and generate an occasion index as a repeated sequence66 and then sort
on these variable carrying the long age variable.
join 'child' 'child'
'child' 'child' 'child'
66
c38
We need both the Child and occasion index as MLwiN’s sort command does not maintain the original order
when sorting; this generate vector facility of course can only work when the data are completely balanced,
that is before missing values have been removed.
- 185 -
Finally we replace we long age variable around its overall mean of 11.493 and replace the
values in the equation variables by the correct ones. In the command window
CALC
CALC
CALC
CALC
CALC
c37 = c37 -11.493
'(Age1-gm).12345' = c37
'(Age2-gm).12345' = c37 **2
'(Age1-gm).Girl.12345' =c37 * 'Girl.12345'
'(Age2-gm).Girl.12345' = c37 * c37* 'Girl.12345'
You should notice that the subscript changes to ij. After estimation you should get the
following results.
Which are the same as the unstructured model that we obtained earlier.
- 186 -
We are now going to use MCMC estimation for a range of models.
Estimation Control on main menu
MCMC estimation
Burn in 500k
Monitoring change length 50000
Done
Equation on main menu
MCMC
MCMC methods
Choose Univariate Metropolis Hastings estimation for Fixed and
Random effects67
Done
Equation on main menu
MCMC
Correlated residuals
Choose Full covariance matrix - the unstructured multivariate model
Done
68
Start
storing the estimates as Unstruct. Here is the variance and correlation matrix (accessed
through Model on main menu, Estimate tables).
67
68
The default Gibbs sampling will not work correctly when estimating a common fixed coefficient..
It is not possible to use the MCMC option for hierarchical centering to speed the estimation of this model.
- 187 -
The variances and correlations are all separately estimated and it can be seen that there is a
conditional correlation between occasions of about 0. 6 with a variance of around 12 on
each occasion; the between-child increasing variance found in the unconditional
multivariate model has disappeared with the explicit modelling of the different growth of
boys and girls.
We can now impose some alternative restrictions on this model:
Model on main menu
MCMC
Correlated residuals
This will bring up a window with the set of alternatives we can impose on the full covariance
structure of the multivariate model. Choosing each in turn and Start to get the MCMC
estimates and examine the variances and correlation estimates. It is also useful to give the
command BDIC in the command window after each model has been estimated so as the get
a ‘complexity penalized badness of fit measure’. Here are the correlations for ‘All
correlations equal/all variances equal’, that is the homogenous compound symmetry model
It is clear that all the variances have been constrained to be the same (11.6), as have all
covariances and hence correlations (0.61).
The results from the heterogeneous compound symmetry which is when ‘All correlations
equal/independent variances’ are:
- 188 -
Notice that the variances vary as do the covariances, but the correlations are the same
between each pair of occasions.
The results for homogenous 1st order autoregressive, that is AR1 residual structure
all variances equal (this is not a true autoregressive model with lagged responses on the
right-hand side of the equation), are as follows.
Notice that the lag 2 correlation is the lag 1 correlation squared, and the lag 3 correlation is
the lag 1 correlation cubed:
The results for heterogeneous 1st order autoregressive residual structure, that is
‘AR1 structure/independent variances’ (again not a true autoregressive model) are:
- 189 -
With again see the imposed rigid declining correlation with lag.
We can now compare, after checking that the monitoring chain has been run long
enough the Deviance Information Criterion for each model and the estimated degrees of
freedom consumed in the fit (pD), against the homogenous compound symmetry model.
Model for random
part
Unstructured
Homogenous
compound
symmetry
Heterogeneous
compound
symmetry
Homogenous 1st
order
autoregressive
Heterogeneous 1st
order
autoregressive
MLwiN Description
Full covariance matrix
All correlations equal/
All variances equal
Difference in DIC
from Homogenous
CS
100.02 7265.44
4.90
88.94 7260.54
0.00
pD
DIC
All correlations equal/
independent variances
92.59 7262.42
1.88
AR1 structure/all
variances equal
89.33 7431.29
170.75
AR1
structure/independent
variances
93.39 7430.01
169.47
Question 15: have the models been run for long enough? Do the estimates for pD
make sense? Which is the preferred model if the DIC criterion is used?
Question 16: make a plot of the predictions from the fixed part estimates of the
results for boys and girls in urban and rural areas at different stages of
development. What do the results show?
- 190 -
These results have completed the analysis of the Madeira Growth Study; there is no
evidence that a more complex model than a compound symmetry approach is needed. We
now proceed to more technical matters. First we consider the difference between
population average and subject-specific models and then consider the vexed question of
whether fixed or random coefficient formulations should be used for subject-specific
models. The Madeira data will be used to illuminate both debates.
Discrete outcomes: population average versus subject specific
Earlier in this chapter we considered briefly the difference between two types of modelling.
One being the conditional, random-effects, subject-specific approach which has been the
main focus of this chapter; the other being the marginal or population-average approach
epitomised by the GEE form of estimation.69 In the Normal-theory model for a continuous
response, there may be differences in the estimation procedures for the fixed part of the
model but the resultant conditional and marginal estimates and their interpretation are very
unlikely to be substantively different. This is usually not the case however when the
response is discrete and a non-linear transformation has been made of the response
variable (see Chapter 12, this volume). The values for the population average are likely to be
smaller in absolute value and the difference between the two estimates will be largest when
there is substantial between-subject heterogeneity, or equivalently a high degree of
dependency, which is commonly the case when analysing longitudinal data.
Figure 13 Subject-specific and population average estimates on the logit and the
probability scale.
69
Marginal is used to emphasize that the mean response is modelled conditional only on covariates and not on
other responses or random effects.
- 191 -
Figure 13 aims to highlight and explain these differences. On the left-hand side of
the graph is the log-odds of a Yes’ outcome for a binary response; on the right-hand side is a
plot of the probability of a Yes outcome. On the logit graph there are six lines. The five
parallel lines represent the results from a conditional model where the middle line is the
fixed-part estimate representing the subject-specific mean and each of the subject lines in
this random intercepts model depart from this line by an amount uoj. The greater the level 2
between-child variance the greater apart the lines will be. Also shown on this logit graph is
the marginal mean estimate of the relationship. The marginal approach is concerned with
treating the dependency between occasions as a nuisance and there is no explicit term in
the model for subject-level differences; there are no u0j’s. Consequently, we cannot plot the
subject-specific lines to look at modelled growth trajectories in the marginal model. They do
not exist, and we only have the marginal model mean estimate – the population average.
This is akin to a single-level model estimate with the standard errors appropriately inflated
due to between-occasion dependence.70 Clearly the marginal mean relationship has a much
shallower slope than the subject specific mean. The departure depends on the betweenchild variance. If there is no higher level variance we are back to a single-level model and
the marginal and conditional estimates will be the same. An approximate formula that
relates the two types of slopes is:
√(
(
)
(37)
Turning now to the probability scale in Figure 13, there are again four subject
specific relations and these have been obtained by ‘anti-logiting’ the values of the left hand
side of the diagram. The line that passes through the middle of the four subject curves can
be obtained either by ‘anti-logiting’ the conditional fixed part estimate or equivalently by
taking the median of the subject specific probabilities at each value of the predictor
variable, here Age. In contrast the mean probability is obtained by either ‘anti-logiting’ the
marginal mean or equivalently by averaging over the subject specific probabilities. Thus the
probabilities obtained from a conditional model are the medians of the subject-specific
curves while the marginal models are equivalent to the means of the subject-specific
probabilities. In a two-level longitudinal model, the one gives you the results for the median
person, the other gives you the result for the mean person. In a three-level model of
occasions nested with children within schools one gives you the probabilities for median
person in the median school; the other the mean result in the mean school.
The cause of these differences is technical – the mean of a non-linear function (the
logit) does not equal the nonlinear function of the mean, whereas the median of the logits
does equal the logit of the median, it is still is the middle value. In Normal-theory models
70
When number of subjects are large and missing data are not an issue, the single-level model estimates will
be the same as the GEE estimates.
- 192 -
you can get the population average in two exactly equivalent ways: either as the curve of
the means or the mean of the curves. In the non-linear model used for discrete data
however this works for the median but not the means.
There is considerable disagreement in the literature over which is the appropriate
model for discrete data some favouring the population average approach, others the
subject-specific. The debate can be resolved by separating how the model is specified and
estimated on the one hand, and how it is used for predictions or inferences on the other.
Thus, Lee and Nelder (2004) argue that the conditional model specification is more
fundamental as you can estimate a conditional model and make marginal predictions from
it, if these are wanted.71. But they caution that the difference between the models is not
just one of estimation in that the marginal model is fundamentally limited as it is not readily
extendible to higher levels nor can it include time varying heterogeneity (random slopes)
which may be of genuine substantive interest. They approvingly quote nine drawbacks of
the marginal model listed by Lindsey and Lambert (1998) who provide an example where a
treatment can be superior on the average, while being poorer for every individual. Lindsey
and Lambert conclude
‘the "statistical" argument that we should directly model margins if scientific interest
centres on them, is not acceptable on scientific grounds, for it implies that we are
generally imposing more unrealistic physiological mechanisms on our data than by
direct conditional modelling and that these are most likely rendering simple marginal
models greatly biased’72
Lee and Nelder (2004) in their turn, draw clear conclusions
‘the use of marginal models can be dangerous, even when marginal inferences are of
interest. The usefulness of marginal inferences requires the absence of interactions
[ie random slopes] checkable only via conditional models.’
The problem of the conditional versus marginal model can therefore be resolved by
fitting the more flexible conditional model but using it to make inferences if needed by
‘marginalising the mixed model’. The sole advantage of marginal estimation is that randomeffect estimates and standard errors may be more sensitive to assumptions about the
nature of the residual structure. But of course this advantage only applies if between-child
heterogeneity is not changing with time and a random slopes model is not needed and
71
Lee, Y-J and Nelder, J A (2004) Conditional and marginal models: another view Statistical Science, 19, 219228.
72
Lindsey, J. K. and Lambert, P. (1998) On the appropriateness of marginal models for repeated measurements
in clinical trials. Statistics in Medicine 17 447-469.
- 193 -
there is no higher level variance of interest.73 We therefore require a method of having
estimated the more flexible conditional model of turning it into population average
predictions. Analytically this is difficult as you have to integrate over the continuous random
effects, but it is relatively easy to provide a solution based on simulation. Consequently, in
MLwiN a very general simulation approach is adopted, the conditional model is estimated
and these can be turned into probabilities by anti-logiting the logits to get the median
probabilities. To obtain the population average values, simulations are made for thousands
of subject-specific curves based on the fixed-part conditional estimates to which are added
the subject-specific departures drawn from a distribution that has the same variances as
that estimated in the conditional model. These are turned into probabilities and their mean
taken to get the population values. At its simplest this can be seen as a 4 step procedure:74
Step 1: Simulate M (5000 say) values for random effect u0j’s from a Normal
distribution with same variance as estimated level 2 variance,
Step 2: using these 5000 u0j’s, combined with fixed part estimates and particular
T
values of the predictor variables to get predicted logits: ˆ x * u ( m)
Step 3 anti-logit to get probabilities
 (*m)  [1  exp( (ˆ T x * u( m) ))]1
Step 4 take the mean of the probabilities to get the population average predictions
for particular values of the predictor variables.
All of this is done fairly automatically in the customised predictions window of MLwiN. This
procedure is very general in that it can be applied in MLwiN to probit and logit models for
binary outcomes, to ordinal response models and models where the response is a count,
and to multivariate models with more than one response.
73
For more on the debate see In the longitudinal case see Ritz, J and Donna Spiegelman (2004) Equivalence
of conditional and marginal regression models for clustered and longitudinal data, Statistical Methods in
Medical Research, 13(4), 309-323, and Hu, F B., Goldberg, J, Hedeker, D., Flay, B R, and Pentz, M-A( 1998)
Comparison of population-averaged and subject-specific approaches for analyzing repeated binary outcomes.
American Journal of Epidemiology 147:694-703. The debate has recently spread to non-longitudinal settings
where Subramanian and O’Mally (2010) make the case that the choice depends on the purpose of the analysis
to counter Hubbard et al’s (2010) claim that population average models provides a more useful approximation
of the truth. Subramanian S V and O'Malley AJ. (2010) The futility in comparing mixed and marginal
approaches to modeling neighborhood effects. Epidemiology 21, 4: 475-478; Hubbard, A E et al (2010) To GEE
or not to GEE: comparing population average and mixed models for estimating the associations between
neighbourhood risk factors and health, Epidemiology, 21, 467-474.
74
In practice there is an additional step in that simulations are also made of the fixed part estimates based on
the variance-covariances of the estimates so that we can get confidence intervals for the predictions. More
details in the MLwiN Manual Supplement
- 194 -
There are therefore two sets of results. The subject-specific approach models the
evolution of each individual and consequently the conditional slope is the effect of change
for a particular ‘individual’ of unit change in a predictor. In contrast the population average
slope (which can be derived from the conditional estimate) is the effect of change in the
whole population if everyone’s predictor variable changed by one unit. So the answer to the
question, which is better, depends on the use to you want to make of the different
estimates. Allison (2009) makes the following contrast (p36)75
“if you are a doctor and you want an estimate of how much a statin drug will lower
your patient’s odds of getting a heart attack, the subject-specific coefficient is the
clear choice. On the other hand, if you are a state health official and you want to
know how the number of people who die of heart attacks would change if everyone
in the at-risk population took the stain drug, you would probably want to use the
population–averaged coefficients. Even in the public health case, it could be argued
that the subject-specific coefficient is more fundamental’
To put it another way marginal estimates should not be used to make inferences about
individuals as that would be committing the ecological fallacy, while conditional should not
be used to inferences about populations as that would be an atomistic fallacy. Or to put it
yet another way the population average approach underestimates the individual risk and
vice-versa.
Subject-specific and population average inferences in practice.
To appreciate these concepts we first take a simple fabricated example and then apply the
ideas to the Madeira data. Beginning with the simple example, say we have estimated in a
conditional model the log-odds of having a disease in a random-intercepts, variancecomponents model to be -1.5 and the level 2 variance to be 0 as in the model below.
75
Allison, P D (2009) Fixed Effects Regression Models, Quantitative Applications in the Social Sciences Series,
Sage, Thousand Oaks, California
- 195 -
If we then calculate the subject-specific median probability via the customised predictions
procedure, the probability of the disease for a typical individual (the subject-specific
estimate) is 0.18, while prevalence of disease in the population is also 0.18, as there are no
differences between individuals and the conditional results are the population values.
However, if the level 2 between-individual increases as in Table 9 to 1 (equivalent to a VPC
of 23 per cent, based on a standard logistic level 1 variance of 3.29; see Chapter 12 this
volume) the individual risk is the same at 0.18, but this equates to a population prevalence
of 0.22. As the between-person variance grows (the level 1 variance cannot go down in a
binomial logistic model), implying that there is substantial between-person variability in
their propensity for the binary outcome, the differences between the mean and the median
grows. So when the VPC is 86% signifying extreme between-people heterogeneity, the
estimate of the population prevalence is 38 per cent while the individual risk remains at 18
per cent.76
Table 9 Subject- specific estimates for different levels of between variance77
Betwee
Probability
Logit
n
VPC
Subject specific
Pop Average
Varianc
e
Mean
(
) Median
1.5
0
0.00
0.18
0.18
1.5
1
0.23
0.18
0.22
1.5
3
0.48
0.18
0.27
1.5
5
0.60
0.18
0.30
1.5
8
0.71
0.18
0.33
1.5
20
0.86
0.18
0.38
We can see also how these ideas play out in a simple case with the Madeira data.
We first recode endurance into a 1 for above the mean and a zero below. Then we fit a
single level logit model with a Bernoulli level-1 distribution that will approximate a
population average model but with incorrect standard errors; this is done for simplicity for
76
A very large between-variance means that a large number of subjects had the same response over occasions
so that many have 00000, while others have 11111. This is known as a mover-stayer situation. When this is the
case, it is questionable whether the Normality assumptions of the level 2 variance are sensible. It may be more
appropriate to fit anon-parametric model whereby the level 2 distribution is discrete and non- continuous; see
Aitkin, M. (1999) A general maximum likelihood analysis of variance components in generalized linear models.
Biometrics 55, 117-128; one such implementation is the npmlreg package in R, others are to be found in Latent
Gold, gllamm in Stata and, Mplus.
77
The table was produced by using MLwiN as a ‘sandpit’. A two-level variance-components model was set up
and then the desired value of 1.5 was put into the column c1098 where the fixed estimate is stored and the
level 2 variance values of 0, 1… 10 were successively put into c1096 where the variance estimates are stored.
The customised predictions window was then used to get the median and mean values by simulation.
- 196 -
just the linear term of centred age. This is followed by a random-intercepts subject-specific
model. Finally we make customised prediction based on both the medians and the means
and graph the results. Both models are estimated by MCMC 50k monitoring chains so we
can compare the DIC
To recode the response variable to a binary outcome, use the recode command.
The single-level model is estimated to be
The customised predictions for age 7 to 17 in steps of 1 using both the mean and medians
are as follows; as expected these are the same values in this ‘population-average model’.
The confidence intervals should be treated with caution as the dependency over time is not
being modelled in any way.
Age
Median
11
12
13
14
15
16
17
0.452
0.507
0.562
0.616
0.667
0.714
0.757
Median
95% CI
0.425
0.480
0.532
0.581
0.627
0.670
0.710
Median
95% CI
0.479
0.535
0.592
0.650
0.704
0.754
0.799
- 197 -
Mean
0.452
0.507
0.562
0.616
0.666
0.713
0.756
Mean
95% CI
0.425
0.480
0.532
0.581
0.627
0.670
0.710
Mean
95% CI
0.479
0.535
0.592
0.650
0.704
0.754
0.799
Here are the estimates of the random-intercepts, subject specific logit model, and we can
see a very substantial between child variance
Making the assumption of the level 1 distribution being a standard logistic distribution, the
degree of dependency over occasion can be calculated to be

 u2
 u2  3.29
(38)
calc b1 = 8.101/ (8.101 + 3.29)
0.71118
Which is clearly a high level of dependency or equivalently there are substantial differences
between children of the same age.
Comparing the two sets of estimates
Response
Fixed Part
Constant
(Age-gm)
Random Part
Level: Child
Constant/Constant
DIC:
Single level
HiLoAerobic
S.E.
-0.096
0.221
1838.805
0.055
0.020
Random
Intercepts
HiLoAerobic
S.E.
-0.203
0.455
0.187
0.037
8.101
1233.252
1.371
it can be seen that the DIC is considerably lower in the model that takes account of between
child heterogeneity and that the slope with Age is more than twice as steep in the subject
specific model. It is also clear that the standard error of this time-varying variable has been
- 198 -
underestimated in the single-level model. The successfulness of the approximate prediction
formula can be seen in that
calc b1 = 0.455/ (( 1 + 0.346*8.101 )**0.5)
0.23332
Is quite close to 0.221 even in this rather extreme case with very large between-child
variance of over 8 on the logit scale.
Here are the customised predictions for the subject specific model
Age
11
12
13
14
15
16
17
Median
0.400
0.513
0.624
0.724
0.805
0.867
0.911
Median Median
95% CI 95% CI
0.315
0.491
0.421
0.603
0.530
0.708
0.633
0.797
0.724
0.865
0.798
0.913
0.856
0.946
Mean
0.445
0.500
0.554
0.608
0.659
0.708
0.752
Mean
95% CI
0.402
0.455
0.508
0.559
0.608
0.655
0.699
Mean
95% CI
0.489
0.544
0.599
0.654
0.707
0.756
0.801
And the greater steepness of the median subject-specific line is clear. Here is a plot of both
the mean and median relationship and the subject-specific modelled probabilities curves for
all the children
It is therefore quite straightforward to use MLwiN to fit the mixed model for estimation and
use the predictions facility to report both the subject and specific population results. It is
also a simple matter to report, odds, logits or probabilities by choosing the desired metric in
the customised predictions window.
- 199 -
Heagerty and Zeger (2000) have been able to develop multilevel models with
random effects in a marginal framework rather than the usual conditional one.78 The
important advantage of this approach is that the estimates of the regression parameters are
less sensitive to assumptions about the distribution of the random effects. In subsequent
discussion of this paper, Raudenbush argues that although this work is very important, the
choice of which procedure to use depends as always on the target of inference; a view he
later elaborated in Raudenbush (2009).79 In particular he argues if you are interested in the
size and nature of the random effects then the conditional model is the only choice that fits
conceptually and he recommends a sensitivity analysis to evaluate distributional
assumptions.
Fixed versus Random effects
This section is about the relative merits of a fixed versus random approach to including
subject-specific effects in longitudinal models. This is another area where there has been a
lot of quite trenchant debate about both technical and interpretative matters in which the
sides have tended to talk past each other. Fortunately (and like the marginal and mixed
case) it is possible to have you cake and eat it and use the mixed random-effects model to
derive what the fixed-effects proponents are seeking and a lot more else besides. We will
consider the form of both models, what both sides write about each other, the ‘prize’ that
the fixed-effects advocates are seeking, and why this is apparently not achievable by a
random-effects model. We will then show that this is achievable with a re-specification of
the multilevel model that has separate terms for the longitudinal and cross-sectional effects
of time-varying variables. This brings both substantive and technical benefits and it puts the
Hausman test which is commonly employed to choose between fixed and random effects in
a new light. We will illustrate these ideas using a linear growth model for endurance in
Madiera children. We first however have to make two digressions: we consider the
meaning of endogeneity, then, we consider the notion of within- and between- regressions.
This will allow us to appreciate fully why some researchers prefer fixed to random effects
and why they are mistaken to do so.
A digression on endogeneity
Endogeneity, the violation of exogeneity, is committed if an explanatory, supposedly
independent, variable is correlated with the residuals. More formally this can be specified as
the zero conditional mean assumption, which in a two-level with a level- predictor applies to
both sets of residuals:
(
|
78
)
(39)
Heagerty, P J and Zeger, S L (2000) Marginalized multilevel models and likelihood inference Statistical
Science 15(1), 1-26.
79
Raudenbush, S W (2009) Targets of inference in hierarchical models for longitudinal data , in Fitzmaurice et
al (eds.) Longitudinal data analysis, CRC Press, Boca Raton.
- 200 -
and
(
|
)
(40)
There are in fact three underlying causes of endogeneity (which can occur in combination);

Simultaneity: this in when we can conceive of a reciprocal causal flow so that y
determines x, and x determines y.; or to put it another way, the dependent and
explanatory variables are jointly determined. We are not going to discuss this any
further but note that one form of this is the dynamic type of longitudinal model
where the outcome occurs on both the left and right and side of the equation as
lagged responses. This can be handled by a multivariate multilevel model in which
more than one response can be modelled simultaneously, see Steele (2011)80

Omitted variables bias: this occurs when we have not measured or do not know of
important predictor variables that affect both the outcome and the included
predictor variables. This is very pervasive problem as it is difficult, for example, to
measure ability, aptitude, proficiency, parental-rearing practices and genetic makeup in developmental studies. In labour economics relating the outcome of wages to
years of schooling may be problematic as both variables may be correlated with
unobserved ability. Similarly in comparative political economy the institutional
features of a country may be important, and there may be aspects of a country’s
culture, history and geography that are not readily measurable. If a variable is
correlated with the outcome and with the predictors included in the model, then its
omission will impart bias to the slope estimates of the included variables. We will be
particularly concerned with level 2 or cluster-level endogeneity which arises in the
longitudinal case from correlations between time-varying predictors and omitted
child characteristics so that the level 2 random effects are correlated with level 1
covariates (equation 39). Random-effects models stand accused of being very
susceptible to such bias and that is why fixed effects are often preferred. It will be
shown, however, that it is straightforward to deal with such cluster endogeneity
within the random-effects framework which offers a much more flexible approach.
For other types of omitted variable bias the recommended approach is multivariate
multilevel models which require an ‘instrumental’ variable for them to be
identifiable.81
80
Steele , F (2011) A multilevel simultaneous equations model for within-cluster dynamic effects, with an
application to reciprocal relationships between maternal depression and child behaviour, Symposium on
th
Recent Advances in Methods for the Analysis of Panel Data, 15-16 June, Lisbon University Institute.
81
In attempting to estimate the causal effect of some variable x on another y, an instrument is a third variable
z which affects y only through its effect on x. See Ebbes , P Bockenholt, U and Wedel, M (2004) Regressor and
random effects dependencies in multilevel model Statistica Neerlandica 58, 161-178 and Steele, F., Vignoles,
A. and Jenkins, A. (2007) The impact of school resources on pupil attainment: a multilevel simultaneous
equation modelling approach. Journal of the Royal Statistical Society, A, 170 (3). pp. 801-824.
- 201 -

Measurement error: this occurs when we want to measure a particular predictor
variable but can only measure it imperfectly. Depending on the form of this mismeasurement, endogeneity may result. The cases of multilevel models where a level
one covariate is measured with error is considered by Ferrão and Goldstein (2009)
while Grilli and Rampichini (2011) as we discuss later consider two sources of
endogeneity simultaneously : cluster-level omitted variables and measurement
error.82
A digression on interpreting within and between regressions
For the moment we will move away from longitudinal models and consider the standard
multilevel approach to analysing contextual effects and focus on peer-group effects for
pupils in a class. A standard multilevel model for this is:
(
)
(41)
where
is the current achievement of pupil i in class j and
is past achievement. It is a
common occurrence that as we move away from a null model, and include
not only
does the between pupil variance, , decrease but so does the between-class variance,
.
This must mean that the predictor has an element within it that varies systematically from
class to class. That is there are some classes where the average prior achievement is high
and others where it is low and we would like to get at these differences to assess the size
and nature of the peer-group effect. From this perspective, the standard model is a
confusion in that the slope is a mixture of the effects that are going on at the class and
the pupil level. To disentangle these we can take a level 1 predictor
and decompose it
into two elements, the between-class means, ̅ , and the within-class deviations from
those means,
̅
.83 If both the class mean centred predictor and the class mean
values are included in the model, we can estimate both the within- and between- influences
in a single multilevel model (Snijders and Bosker 1999, section 3.6)
(
̅ )
where
is the within-class slope effect and
alternative formulation is sometimes used:
̅
(
)
(42)
is the between-class slope effect. An
82
Ferrão, M E and Goldstein, H (2009) Adjusting for measurement error in the value added model : evidence
from Portugal, Quality and Quantity 43, 951-963. Grilli , L and Rampichini, C (2011) The role of sample cluster
means in multilevel models: a view on endogeneity and measurement errors issues to appear
83
We can imagine two limiting cases where the class means are all the same so that there is no between-class
variation and at the other extreme, the case where there is no deviations around the class means. This can be
revealed by a two-level variance components model where the response is
. Thus
( );
( )
It is only in the former case that there can be no possibility of peer-group effects.
- 202 -
(
)
̅
(
)
(43)
where the raw predictor is included and not the class-mean centred one. The slope term
associated with the group-mean term is known as the contextual effect as it estimates the
difference the characteristics of the group make to the outcome over and above the
individual effect. Thus if is positive then children progress more if the children around
them are of high ability; if the coefficient is negative the children are somehow deterred if
they are in a high-ability class; sometimes it is better to be a big fish in a small pond! A
model that incorrectly assumes a common effect as in the standard model above (equation
z1) can result in a misleading assessment of the influence of a predictor on the response. 84
The contextual effect is thus formally defined as the difference between the withingroup and between-group coefficients; thus =
so that it can either be derived
firectly from the contextual formulation of equation (43) or indirectly as a difference in the
group-mean centred version of (42). Similarly,
+
. It is worth spelling out in
detail the different and specific meanings of these three coefficients:

equals the estimated difference in progress for two pupils in the same class who
differ by one unit on prior attainment;

equals the estimated difference in class mean achievement between two classes
that differ by one unit on the class mean of prior attainment;

is the estimated difference in progress for pupils who have the same prior
attainment but attend classes that differ by one unit on prior mean attainment.
We will see later that both these specifications allow us to deal with level 2 cluster
endogeneity. For further discussion of the effects of different types of centering in
multilevel modelling see Enders and Tofighi (2007) who also consider including group means
for binary predictors which gives of course the proportion of cases who are in the nonreferent category.85
Fixed and random effects specifications
The random-intercepts linear growth model can be specified as usual as:
(
84
)
(44)
The standard multilevel model of (z1) estimates a weighted average of the within- and between- group
effects, where the weight for the within-effect becomes more important when there are a large number of
level 1 observations, the level 1 residual variance is small, and the level 2 variance becomes large; the relevant
formulae are given in Rabe-Hesketh and Skrondal (2008,114). The standard multilevel estimate will only be the
same when the between is equal to the within or equivalently, equals zero. When this is the case the
common slope will be more precisely estimated as it pools the within and between information.
85
Enders, G K and Tofighi, D (2007) Centering predictors in cross-sectional multilevel models: a new look at an
old issues Psychological Methods 12, 12-138. See also Biesanz, J C et al (2004) The role of coding time in
estimating and interpreting growth curve models Psychological Methods 9, 30-52
- 203 -
u  ~ N ,0(
0j
2
u0
e  ~ N ,0(
);
0ij
2
e0
)
where there is a single time-varying predictor,
, the age of the child j on occasion i. The
key distinctive element is that the differential intercepts, one for each child, are seen as
coming from a Normal distribution with a common variance. This is a very parsimonious
model as we only have to estimate a single variance term and not hundreds or thousands of
separate terms for each child. In contrast in the fixed-effects counterpart, which is the
dominant approach for example in comparative political economy and much of economics:
∑
(
e  ~ N ,0(
0ij
2
e0
)
(45)
)
there are m-1 additional fixed effects associated with m-1 dummy variables, one for each
child, where m is the number of units, the children at level 2.
Views on these two models can be much polarised. Thus Molenberghs and Verbeke
(2006, 47) from a biostatistics background contend that
‘ the fixed effects approach is subject to severe criticisms as it leaves several sources
of variability unaccounted for and to worsen matters, the number of fixed effects
parameters increases with sample size, jeopardizing consistency of such
approaches’.86
In addition, the conceptual argument can be made that the random-effects model allows
generalization to the population of children and not to specific children as in the fixedeffects model. Moreover, the fixed-effects model cannot include subject-level variables
(that is time-invariant variables such Sex and Rurality) as all the degrees of freedom have
been consumed at the child level.87 As Fielding, something of a renegade economist with
these views (2004,4-5)writes88
‘It is only a random effects specification that can handle level-two covariates and the
extent to which level-two covariates can explain level-two variation. It is clear that
fixed effects specification for uj [ subject specific differences] are unsuitable for many
of the complex questions to which multilevel methodology has been addressed’.
86
Molenberghs, G and Verbeke, G (2006) Models for discrete longitudinal data Springer, Berlin
Equivalently the child-level variables will be collinear with the set of child-level dummies, rendering the
coefficients un-identifiable.
88
Fielding A (2004) The role of the Hausman test and whether higher level effects should be treated as random
or fixed, Multilevel Modelling Newsletter, 16(2), 3-9
87
- 204 -
And yet Allison (2009, 2) can argue
‘such characterisations are very unhelpful in a non-experimental setting, however,
because they suggest that a random effects approach is nearly always preferable.
Nothing could be further from the truth’.89
The ‘prize’ must be very great indeed if the obvious relative advantages of random effects
are going to be spurned, and it is.
To understand the nature of the ‘prize’ it is helpful to consider an experimental
situation where the x1ij variable is a treatment which can be switched on and off and the
treatment is randomly allocated.90 This random allocation guarantees, provided the
protocols are followed, that the all other sources of systematic influences on the response
and the intervention are held at bay. Consequently, those being treated and those in the
control group not receiving the treatment are in equipoise so that that all potential
confounding factors related to the treatment and to the outcome are held off. Any observed
differences between the two groups must then be due to the intervention. Internal validity
is guaranteed to a known degree, dependent only on the size –the number of replicates- of
the experiment. Randomisation controls for confounders even if they have not or indeed
cannot be measured. This is the ‘prize’ - by including fixed coefficients in an observational
study for each child, all child-level effects, known or unknown have been removed. Each
individual has become their own control. Cluste- level endogeneity has been solved as all of
the variance at the child level has been wiped away. The gold standard methodology of
random allocation can be achieved in observational studies, but only if we adopt the fixedeffects approach. From this perspective the random-effects model is seen as seriously
deficient. Instead of a method to deal with endogeneity, the random-coefficient model is
based on the fundamental and required assumption that there is no endogeneity or
equivalently that omnisciently we have included all subject-level covariates that influence
the response.
In practice to compute the fixed-effects model we could either include set of
dummies (but this gets cumbersome with thousands of children) or get exactly the same
results by the mean deviations method. In this procedure, we calculate for all time-varying
variables – predictors and response - the mean value for each child over time, and then
subtract this child-specific mean from the observed values of each variable and regress the
de-meaned response on the de-meaned predictors:
(
̅ )
(
89
̅ )
(46)
Allison, P D (2009) Fixed Effects Regression Models, Quantitative Applications in the Social Sciences Series,
Sage, Thousand Oaks, California; similar arguments are made at http://www.scribd.com/doc/47380855/FixedEffects-Regression-Methods
90
The editor’s introduction to Allison (2009) referring to the fixed-effects approach, writes ‘these statistical
models perform neatly the same function as random assignment in a designed experiment’, p.ix.
- 205 -
We can note three things about this formulation

We treat the child differences as a complete nuisance and calculate away the
subject-specific values; so that there are no
values in this equation; this means
we can estimate it as a single-level model using OLS;

Child- level variables, that is stable time-invariant variables, will be reduced to a set
of 0’s and can no longer be modelled, as they will not vary;

The mean difference approach sweeps out all the between-children variation and
control has been achieved for unobserved variation at the child level; the prize has
been achieved.
Mundlak formulation
The third point is key for our argument. Instead of de-meaning to remove child
characteristics, we could also control for it by modelling it away, that is by including the
group mean in the equation for each predictor that is time varying. This model is called the
Mundlak (1978) specification.91 It is exactly the same as the contextual model formulation
of earlier, so that we include the group mean in the model:
(
)
̅
(
)
(47)
Consequently the estimate will not be biased by the omission of group–level variation
associated with that variable as it has been modelled out through . To stress the point we
are making, when group means are include as in the Mundlak approach, the random-effects
model will yield the same within estimate of the slope of a time-varying variate as a fixedeffects specification of the estimate. But there is an added advantage. As we have not
expunged all child variation (the response has not been de-meaned) and we have
unexplained variation at the child level ( ), we can include further time-invariant child
predictors to account for this variation. We can also have a much less restrictive structure
than the fixed-effects model in that we can three level models and can model explicitly
complex heterogeneity and dependence which can be of substantive interest. However, we
do need a multilevel model with its random effects so as ensure correct standard errors of
the fixed part estimates. The downside of course is that for each time-varying variable we
have to include an extra term – the group mean in the model. But this is not a great deal of
trouble.92
91
Mundlak Y (1978) On the pooling of time series and cross section data. Econometrica, 46, 69-85. He could
not have been clearer(p70) ‘the whole approach which calls for a decision on the nature of the effect whether it
is random or fixed is both arbitrary and unnecessary’.
92
Although Allison’s (2009) monograph is called ‘the fixed effects regression models’, he does realize the
power of the Mundlak approach which he calls a ‘hybrid’ specification. While it has the intent of a fixed-effects
- 206 -
As earlier, instead of the contextual formulation, we can also use the alternative
within- and between- formulation:
(
̅ )
̅
(
)
(48)
so that the group mean is included and additionally the time-varying predictor is de-meaned
or group centred. Consequently the within- group coefficient is given by , and this
measures the longitudinal effect of the predictor. In contrast, the between-group coefficient
is given by , and this measures the cross-sectional effect of the predictor. As in usual
there could be contradictory process going on at each level which may be of substantive
interest; this would have been lost altogether in the fixed-effects approach. What we had
previously discussed as a method to disentangle within and between effects in a random
coefficient model is now seen as a solution to level 2 endogeneity. Technically, the group
mean ̅
is an instrumental variable for
as it is correlated with
but uncorrelated
with the random intercept
.We can also view this as the random intercept, conceived as
̅
being uncorrelated with
.
When group means are included, the within estimate of the random-effects model will yield
the same estimated effect as a fixed specification. From this perspective the fixed -effects
approach is deficient as it disregards the cross-sectional variance of all the variables in the
model and only uses variation over time to estimate the parameters of the model. While the
fixed estimator is the dominant approach in econometrics and comparative political
economy, it is seldom used by the educational researcher. 93 This is because it wipes out the
variables of key focal interest that lie at the higher level. So in studying pupils in classes in
schools, if we included pupil fixed effects we would not be able to estimate peer group
effects, teaching styles or school climate. In the longitudinal situation if we used fixed
effects, we would not able to study the effect of time-invariant variables or their
interactions so that we cannot estimate and study even quite fundamental aspects like
differential growth rates for boys and girls. This can also be an issue when the variable of
causal interest is nearly time-invariant as is often found when longitudinal models are used
to estimate the effect of changes at the country level; conceptually they are time-varying
model in controlling for unobserved confounders, there is no disguising that it is fact a random-effects model
with group-mean centering.
93
As recently as 2007 Beck and Katz were able only to identify two applications of the random-coefficient
model to time-series cross section data in comparative political economy, and both of them were by the same
author, and did not include time as a predictor! Their paper goes on to show in a set of simulations how
effective the model is for estimating overall slope and country specific values even when there is nonnormality of between country differences and outliers are imposed. Moreover they find that the approach will
not mislead, so if there is no country heterogeneity present, the random-effects model will not falsely find it.
The simulation was designed to examine a typical comparative political economy study with 20 countries and 5
to 50 occasions with the models being estimated by maximum likelihood. With a larger number of higher-level
units and MCMC estimation that takes account of the uncertainty in both the fixed and random parameters,
even better results could be anticipated. This was indeed found by Shor et al (2007) in their Monte Carlo
evaluation of MCMC estimation of time-series cross-section data.. Beck, N and Katz, J N (2007) Random
coefficient models for time-series cross-section data: Monte-Carlo experiments, Political Analysis, 15, 182195. Shor, B Bafuni, J Keele, Park , D (2007) A Bayesian multilevel modelling approach to time-series crosssectional data, Political Analysis 15,165-181.
- 207 -
predictors but in practice, the changes are only slight and may be confined to only a few
units. Such slow moving variables include the presence of democracy at the country level or
the presence of a particular institution in a State over time in US politics. If this is the case,
and the variables of interest show much more variation between units than within, the
fixed-effects estimator will be very inefficient and considerable efforts have been made to
get improved fixed estimates (Plümper and Troeger,2007). 94 But as Shor et al (2007) found
in their Monte Carlo evaluation of MCMC estimation of time-series cross-section data,
random effects models are able to recover the effects of such partially time-invariant
predictors that are slow moving.95
Finally it is worth stressing that the approach advocated here deals with only one
form of endogeneity bias, the correlation that arises from included occasion-level variables
and omitted child characteristics, that is cross-level or level-2 endogeneity between
and
. It does not protect from same level endogeneity such as the bias from correlation
between included child effects and omitted child characteristics same level endogeneity
(between
and ̅ or
) Thus the coefficients for other level 2 variable such as gender
may be subject to bias as we are not controlling for unmeasured predictors at level 2. Nor
does it protect from correlations between included variables and level-1 residuals. This
would usually require the approach of instrumental variables or simultaneous equations;
see Ebbes et al (2004) and Kim and Frees(2007).96
The Hausman test
Since the 1980’s most applications of panel data analysis have made the choice between
random and fixed effects based on Hausman’s (1978) test.97 A fixed-effects model is fitted
and then an equivalent random-effects specification is estimated. The difference between
the two results is the test statistic and if the differences are significant, cluster-level
endogeneity is deemed to be present and the random effects are abandoned in favour of
the supposedly more robust-to-endogeneity fixed effects. In fact the Hausman test which
really is a very general test that can be applied to a wide variety of different model
misspecifications is in the longitudinal case (Baltagi, 2005, Section 4.3) exactly equivalent to
testing that the within- and between- estimates are different or equivalently that the
contextual effect is zero.98 In this light, the Hausman test should not be seen as just a
94
Plümper. T and Troeger, V, 2007, Efficient estimation of time-invariant and rarely changing variables in finite
sample panel analyses with unit fixed effects, Political analysis,15,124-39.
95
Shor, B Bafuni, J Keele, Park , D (2007) A Bayesian multilevel modelling approach to time-series crosssectional data, Political Analysis 15,165-181.
96
Ebbes , P Bockenholt, U and Wedel, M (2004) Regressor and random effects dependencies in multilevel
model Statistica Neerlandica 58, 161-178 they recognize 16 types of ‘correlation’ between included level 1
and level 2 variables and level 1 and level 2 residuals; only one of these is no endogeneity bias, and only one of
these is tackled by the Mundlak specification ; Kim, J-S and Frees, E W (2007) Multilevel modelling with
correlated effects Psychometrika, 72, 505-533.
97
Hausman, J A(1978) Specification tests in econometrics, Econometrica, 46 (6), 1251–1271.
98
Baltagi, B H (2005) Econometric analysis of panel data Wiley, Chichester.
- 208 -
technical decision but is of very substantive importance – a significant result means that
there is evidence that different process are operating longitudinally from those that are
operating cross-sectionally. To take an example, life satisfaction may not just be affected by
the change to unemployment (the longitudinal effect) but also proportion of time spent in
unemployment. Methodologically a significant contextual effect is a warning not to commit
the atomistic or ecological fallacy, a cross-sectional analysis would miss important
longitudinal processes, while a purely longitudinal analysis would miss the importance of
stable or enduring effects. Moreover, its standard use in choosing one formulation over
another is beside the point as the group-mean centred formulation achieves the fixedeffects goals in a much more parsimonious way.99 Given the very widespread application of
the Hausman test as a purely technical arbiter, it is very clear there is a great deal of
confusion about the relative merits of the fixed and random approach.
Hanchane and Mostafa (2010) provide a lovely illustration of these argument albeit
in a non –longitudinal setting.100 They examine the level 2 endogeneity that arises from
correlations between student characteristics and omitted school variables. They argue that
the cause of this is stratification such that poor households are likely to live in poor
communities due to the functioning of the housing market, so that local schools do not have
a random allocation of children but the entry will be selective and non-homogenous.
Apparent school effects may then be a by-product of the social mix of their pupils. Their
new twist in the argument is that this may be differential by the organisation of the
prevailing education system; being lessened in comprehensive Finland, heightened in
Germany with its early selection, and in-between with the English liberal educational
management system. Consequently the degree of endogeneity will vary from system to
system. They then evaluate this by analysing the international Pisa data with student
mathematics scores in the 2003 survey as their dependent variable. They find that if they
omit the school peer effects (the group means of the Mundlak formulation), the Hausman
test does find endogeneity in the manner specified so Finland has the lowest value and
Germany the highest. Moreover, including the group means results in the Hausman test
becoming zero in all three countries.
Exemplifying the Mundlak approach
Neuhaus and Kalbfeisch (1998) report a two-level multilevel model of birth weights (in
grams) for 880 women (j) who have each had 5 births(i) and the predictor is the Age of the
99
Snijders T A B and Berkhof, J (2008) Diagnostic checks for multilevel models in de Leeuw, J and Meijer, E
(eds.) Handbook of multilevel analysis, Springer, New York describe the Hausman test (p147) as ‘slightly beside
the point’.
100
Hanchane, S and Mostafa, T (2010) Endogeneity problems in multilevel estimation of education production
functions : an analysis using Pisa data, LLAKES Research paper 14 (http://www.llakes.org)
- 209 -
mother at the birth.101 They report a standard two level compound symmetry model that
presumes that the within- and between- results are identical:
(
)
(
)
and a model with group-mean centred Age and mean Age:
(
)(
̅̅̅̅̅ )
(
)̅̅̅̅̅
(
)
It is clear that the three slopes estimates are very different and the slope in the standard
model at 17.34 is an un-interpretable amalgam of the slope of 11.83 when Age is group
mean centred and 30.35 for the slope for mean age. Both the terms in the latter model have
a clear interpretation. The within or longitudinal effect shows that as a given woman ages
by one year her children increase by an average of 11.83 three grams. The between or
cross-sectional effect estimates that average birth weight will differ by 30.35 grams
between two women whose average at birth differs by one year. This coefficient estimates
differences in birth weight for women who had children at different periods of their lives
(the median age of first child was only 17). A Wald test of the difference in the betweenand within- slopes (equivalent to a Hausman test) is highly significant but that should not
mean that we abandon the random- coefficient model for the fixed-effects procedure as we
already have the within longitudinal effect corrected for level 2 endogeneity, but rather we
pay attention to the substantive interpretation of the two processes at work.
Measurement error of the cluster means
The inclusion of the level 2 cluster means in the Mundlak formulation is of course based on
the assumption that this is the true mean whereas in practice it is a sample-based value
which may be based on relatively few observations. Ignoring this results in a measurementerror problem in which the contextual effect will be attenuated and the unexplained level 2
variance inflated and there is the possibility of biased estimates for other cluster-level
variables. The within-slope estimate remains unbiased as does the level 1 between-occasion
variance. We have resolved one form of level 2 endogeneity only to be confronted by
another . Grilli and Rampichini (2011) show how it is possible to correct for this
measurement error post-estimation by taking account of the reliability of the sample group
mean.102 They estimate this from another multilevel model where the outcome is the
predictor of interest,
. Thus
101
Neuhaus, J M and Kalbfeisch, J M (1998) Between and within cluster covariate effects in the analysis of
clustered data Biometrics 54, 638-645.
102
Grilli , L and Rampichini, C (2011) The role of sample cluster means in multilevel models: a view on
endogeneity and measurement errors issues to appear
- 210 -
(
);
(
)
(49 )
The reliability of the group means in a balanced model is then
(50)
so that in a typical longitudinal setting the level 2 variance would be equal to the level 1
variance and with n equal to 5 occasions, the reliability would be 0.83, whereas if there
were only 2 occasions reliability would fall to 0.66. Their method is to inflate the estimated
attenuated contextual effect by dividing by this reliability:
(51)
When there is unbalanced data they recommend computing the reliability for each cluster
and averaging it. A modified procedure is needed when the level 1 units are a finite sample
of the higher-level units (there are only so many children in the class) for as the sampling
fraction (n/N) increases, so reliability improves. They suggest that you need greater than 30
clusters for this approach to be effective. It is probably best to regard this very easy-to-use
procedure as a warning device, alerting us to potential attenuation in a particular analysis.
We can then, if the problem is found to be severe, adopt the more complex model
procedure developed by Shin and Raudenbush (2010). They tackle this issue by treating the
group mean as a latent variable. In its simplest form this involves replacing the group mean
by the shrunken level 2 residual from the above variance components model (equation
49) where
is the response. They go on to develop a more flexible formulation in a
multivariate multilevel framework that additionally deals with the situation when the
sample group mean is based in part on missing data. 103
Re-analysing the Madeira growth data
To compare a full range of model we fitted 4 different specifications
1. a fixed effects linear growth model, fitted by OLS with 299 dummies, individual Age
is grand mean centred;
2. a random-intercepts two-level model fitted by RIGLS, individual Age is grand mean
centred ; this is a random-effects equivalent of 1;
3. a random-intercepts two-level model fitted by RIGLS with the contextual
formulation; group mean Age is included as grand mean centred variable and
individual Age is also included as a grand-mean centred variable;
103
Shin, Y and Raudenbush, S W (2010) A latent cluster-mean approach to the contextual effects model with
missing data Journal of Educational and Behavioural Statistics 35(1) 26-53.
- 211 -
4. a random-intercepts two-level model fitted by RIGLS with the within- and between
formulation; group mean Age is included as grand mean centred variable, but
individual Age is included as a group-mean centred variable.
The grand mean centering of the group Age variable is adopted to get more interpretable
intercepts and to allow more direct comparison between models
Here are the specifications in MLwiN:
Model 1 is a fixed effect analysis with Child 1 as the reference category; to achieve this,
Child must be toggled to categorical 104
Model 2: equivalent to model 1 but a random- intercepts formulation and grand-mean
centred individual Age
104
It is possible in MLwiN to fit all 300 dummies in the fixed part and to include a level 1 differential based on
the overall Constant.
- 212 -
Model 3: contextual formulation; group mean Age is included as grand mean centred
variable and individual Age is also included as a grand-mean centred variable
Data manipulation on main menu
Multilevel data manipulations
Operation Average
On blocks defined by Child
Input columns Age
Output columns Free Columns here c28 is chose
Add to action list
Execute
Naming c28 as ChildAvAge, add to the model and then centre around grand mean to aid the
interpretation of the overall intercept
Model 4: the within and between formulation with group -mean centering of Age, achieved
by modifying the term for individual Age, and choosing group mean centreing on the Child
index variable
- 213 -
to get the following specification
The results of the four models are shown in the Table 9.
Model
Fixed part
Constant
(Age-gm)
(Age-(Child))
(ChildAvAgegm)
Random
Level 2
Level 1
Deviance
Table 10 Estimates of alternative specifications
1:
2:
S.E.
3:
S.E.
Fixed
S.E. Random
Contextual
0.63
5.09
6044.4
7
0.02
0.19
17.84
0.63
0.21
0.02
5.09
12.36
7099.57
0.22
1.10
17.84
0.63
0.21
0.02
0.36
0.40
5.09
12.36
7098.77
0.22
1.10
4:
withinbetween
S.E.
17.84
0.21
0.63
0.99
0.02
0.40
5.09
12.36
7098.77
0.22
1.10
Comparing the fixed estimate of the individual Age term in models 1 and 2, the slope is
exactly the same, 0.63, whether a fixed or standard random effects models are used. There
is therefore no evidence of cluster-level endogeneity. When the contextual model 3 is fitted
with Child average age, the within-child age effect of 0.63 is exactly the same as the
standard multilevel model because the contextual effect of 0.36 is not significant as a Wald
test shows.
- 214 -
Cpro 1 0.794
0.31731
This test is exactly equivalent to a Hausman test; again there is no evidence of endogeneity;
this is also confirmed by the more reliable LRT test of the change in the deviance between
models 2 and 3.
calc b1 = 7099.57-7098.77
0.80000
cpro b1 1
0.37109
Finally we can test the difference between the size of the between and within effects of Age
and again the difference of 0.358, the size of the contextual effect
cpro 0.794 b1
0.37289
is far from significantly different from zero at conventional levels.
- 215 -
Overall there is no evidence of endogeneity and the standard multilevel model
specification of the fixed part that we have been using until this section is supported. You
may like to reflect why this is the case. If you undertake a variance components model with
individual Age as a response, the answer is clear.
There is no substantive variance between children, all the variance around the overall
average age of 11.4 is between occasions. The design of the study is such that we have
followed a single cohort aged 7 at the start for 8 years so that we only have data to detect
individual change; the baseline cross-sectional effect in terms of age is the roughly the same
for all children. In the next Chapter, we will see how it is possible to analyse a design that
does allow us to separate these elements, indeed they are the focus of the study, for the
real Madeira Growth Study allows us to look at cohort change.
What we have learnt
 The random coefficient model is a highly flexible procedure for modelling growth
and development. It allows for heterogeneity and serial dependence in a
parsimonious fashion.

The conditional formulation is more fundamental than the marginal as it is possible
to estimate the former to give the latter; in the discrete response case, different
answers can be obtained but these relate to the different questions that are being
asked. MLwiN through the customised predictions window can provide both; they
will not differ a great deal unless the higher-level variance is substantial

Subject-level confounding can be tackled through the Mundlak specification by
including subject means of time-varying covariates with the considerable advantage
that the effects of between subject covariates can still be estimated. Other forms of
endogeneity need to be tackled by instrumental variables, although finding ‘good
instruments’ (variables that have no direct effect on the response and account for a
substantial proportion of the variation in the predictors) is always difficult in
- 216 -
substantive work. That is why in epidemiological research there has been recent
interest in Mendelian randomization.105

The Hausman test does not provide a reason to prefer fixed effects over random
effects but really assesses the extent to which cross-sectional, between estimates
differ from longitudinal, within ones.

When the sample group means are used the effect can be attenuated when the
sample size within a cluster is relatively small.
105
Davey Smith, G and Ebrahim, S (2005) What can Mendelian randomisation tell us about modifiable
behavioural and environmental exposures? British Medical Journal 330: 1076-1079; Didelez, V. and Sheehan.
N (2007) Mendelian randomization as an instrumental variable approach to causal inference. Statistical
Methods in Medical Research 16:309-330
- 217 -
Answers to Questions
Question 1: obtain a plot of Endurance against Age separately for men and women
(Hint: use Col codes on the graph; what do you find?
Both sexes show some general increase over time but this is more marked for the Boys than
the Girls; both sexes show substantial between-child differences at all ages.
Question 2: use the Tabulate command to examine the means for Endurance for the crosstabulation of Sex and Rurality, what do you find?
Variable tabulated is Endur
Columns are levels of Sex
Rows are levels of Rurality
0
N
MEANS
SD'S
1
N
MEANS
SD'S
Boy
314
18.2
4.27
390
21.2
4.53
TOTALS
MEANS
SD'S
704
19.9
4.41
Girl
309
15.2
3.59
408
16.2
3.27
- 218 -
717
15.8
3.41
TOTALS
623
16.7
3.95
798
18.7
3.93
1421
17.8
3.94
Sex differences in the means are greater than Rurality differences, but the Rurality
differences are greater for Boys and Girls; rural Boys have the highest mean.
Question 3: are there any distinctive patterns of missingness in terms of Rurality
and Occasion?
Command TABUlate 14
'Missing'
'Rurality'
Columns are levels of Missing
Rows are levels of Rurality
Not
Yes
Urban
N
623
32
ROW %
95.1
4.9
CHI
0.10
-0.43
Rural
N
798
47
ROW %
94.4
5.6
CHI
-0.09
0.37
TOTALS
1421
79
ROW %
94.7
5.3
TOTALS
655
100.0
845
100.0
1500
100.0
(CHI = signed sqrt(chi-squared contribution) )
Chi-squared =
0.34 1 df
The Urban and Rural areas are both around 5 percent and there is not a significant
difference between them as shown by the low chi-square value.
TABUlate 14
'Missing'
'Occasion'
Columns are levels of Missing
Rows are levels of Occasion
Not
Yes
TOTALS
Occ1r
N
300
0
300
ROW %
100.0
0.0
100.0
CHI
0.94
-3.97
Occ2r
N
284
16
300
ROW %
94.7
5.3
100.0
CHI
-0.01
0.05
Occ3r
N
274
26
300
ROW %
91.3
8.7
100.0
CHI
-0.61
2.57
Occ4r
N
276
24
300
ROW %
92.0
8.0
100.0
CHI
-0.49
2.06
Occ5r
N
287
13
300
ROW %
95.7
4.3
100.0
CHI
0.17
-0.70
TOTALS
1421
79
1500
ROW %
94.7
5.3
100.0
( CHI = signed sqrt(chi-squared contribution) )
Chi-squared =
28.65 4 df
There is a significant difference between occasions but this largely being driven by the lack
of missingness for occasion 1; indeed occasion 5 has lower missingness than occasions 2 to
4.
- 219 -
Question 4: why does Age-gm have a ij subscript and not a j subscript?
It is an occasion-varying variable.
Question 5: repeat the above procedure to see if a cubic term is necessary
This requires modifying the Age term and choosing a polynomial of 3. The results of the
converged model
shows that the coefficient of the cubic term is not large in comparison to its standard error
(0.006 compared to 0.004). A Wald test (Intervals and Tests window) elicited a chi-square
value of 2.513 with 1 degree of freedom and hence a p value of 0.11. In the interests or
parsimony, we decided not to keep the term nor store the results of this model.
Question 6: why does Girl have a j subscript and not an ij subscript?
It is measured only at the child level, it is a time-invariant variable.
Question 7: repeat the above procedure, building on Model 5, to see if a Rural main
effect is needed and if there is an interaction between Rurality and Age – choose
urban as the base
The question ascertains whether the Rurality effect on Endurance changes with Age. The
main effect model 6 with added Rurality is
- 220 -
Children from rural areas have a higher endurance, so that at average age it is 2.03
hundreds of metres. This is significant at conventional levels and it is smaller than the Boy –
Girl difference (-3.63) at this age. Note the Rural dummy has a j subscript reflecting a time
invariant variable; Rurality was measured at occasion 1.
Model 7 includes the added 1st order interaction between Rurality and the second
polynomial of age.
The two new terms are not large in relation to their standard error, so that there is no
strong evidence that Rurality has a differential effect as children age. The joint Wald test
with 2 degrees of freedom on a chi-square of 2.712 returns a p value of 0.572, while a
likelihood ratio test of the two models returns a chi-square difference of (6811.767 - 221 -
6809.06) which equals 2.707 which returns a p value of 0.258 with 2 degrees of freedom
consumed by the more complex second model. Again there is no evidence of the Rural
effect changing with Age. The size of the effects are again readily appreciated through the
customised predictions. The wish list now consists of 2 Sexes by 2 Rural/Urban groups and
11 Age groups (7 to 17); here is an extract of the predictions.
To get a plot of all the effects, the additional requirement is use the Rurality predictions to
Trellis in the Y dimension so that the graphs are in different columns
- 222 -
Given the non-importance of the Age* Rurality interactions, we remove them from the
model in terms of parsimony.
Question 8: what would happen to the standard errors of the fixed part if it was
assumed there was no dependence?
The current model is a multilevel random-intercepts term, a model assuming independence
requires that the level 2 randompart is removed. Here is the revised model (which can be
estimated by OLS) and a comparison of the estimates.
ML
Estimate
Fixed Part
Constant
(Age-gm)^1
(Age-gm)^2
Girl
(Age-gm)^1.Girl
(Age-gm)^2.Girl
Rural
Girl.Rural
Random Part
Level: Child
Constant/Constant
Level: Occasion
Constant/Constant
-2*loglikelihood:
ML
S.E.
18.11
0.86
0.01
-2.47
-0.44
-0.06
3.06
-2.04
0.356
0.028
0.011
0.507
0.040
0.016
0.462
0.652
6.88
0.642
4.50
6802.10
0.190
- 223 -
OLS
Estimate
OLS
S.E.
17.97
0.89
0.02
-2.32
-0.48
-0.07
3.09
-2.08
0.238
0.044
0.017
0.341
0.062
0.025
0.256
0.362
11.44
7495.61
0.429
Ratio of
OLS to ML
SE
*
0.67
1.57
1.55
0.67
1.55
1.56
0.55
0.56
The standard error of the time invariant estimates are deflated that is they are spuriously
precise when the independence assumption is made. Thus the SE associated with Girl is
estimated to be 0.341 and not 0.507. In contrast the SE of the time varying parameters is
too imprecise when the independence assumption is made. Also notice according to the
Deviance that the OLS is a very substantially worse fit. There is very strong evidence of the
need to model dependency.
Question 9 make a plot of the modelled growth trajectories for Urban Boys using
the above random slopes model; what does it show?
First make the predictions for the base category of male urban children including random
intercepts and slopes at level 2; do not include the level 1 random terms
To obtain a plot of the modelled growth trajectories







Graphs on Main menu
Customised graphs
Choose c50 for Y on the plot what? Tab
Choose Age for Y
Choose Plot type to be a Line
Choose Group to be a Child
Apply
- 224 -
The scale of the between child heterogeneity is again apparent. Visually there is some
evidence of increasing between child heterogeneity as the children develop but in truth
there is not a great deal of departure from the parallel lines of the random intercepts
assumption that variance does not change with Age.
Question 10 what do you conclude from these results; has the monitoring chain
been run for a sufficiently long length; are there differences between the results
obtained from the two types of estimation?
The Effective Sample Size for the most dependent chain is equivalent to 470 independent
draws so that this suggests that the MCMC has been run sufficiently long. To check we can
examine the trajectory and diagnostics for this parameter.
- 225 -
There is no evidence of trending so it looks as if this 50k monitoring is sufficient even for this
parameter that has a great deal of imprecision (the mean of the estimate is around about
the same size as the standard error) . A detailed examination of the two different sets of
estimates finds that the results of both models are exceedingly similar and we would not
reach different conclusions. The deviance from the RIGLS model cannot be compared with
the DIC of the MCMC model.
Question 11 what do you conclude from these results; has the monitoring chain
been run for a sufficiently long length; are there differences between the results
obtained from the two types of estimation? Is the three level model an in
improvement on the two-level model
The Effective Sample Size for the most dependent chain is equivalent to 471 independent
draws so that this suggets that the MCMC has been run sufficiently long. Despite there only
being some 30 schools, the ESS for the between –school variance is large at 3812.
If we examine the trajectory and diagnostics for this parameter, we see a markedly skew
distribution so that while the mode is 1.196, the lower 2.5% quantile is 0.467, the upper
97.5% quantile is 2.729. There does look to be evidence of a school effect but the evidence
is not overwhelming as there is some support in the smoothed histogram that the value is
zero. A detailed examination of the two different sets of estimates finds that the results of
both models are very similar and we would not reach different conclusions. We saw earlier
that the deviance from the RIGLS three level model was significantly lower than the two –
level model suggesting that there is genuine between school differences. If we compare the
DIC of the two models (6429.809 6432.12) there is some evidence that there are differences
between schools, but such a difference does not bring overwhelming evidential support.
- 226 -
Question 12: what are the (approximate) characteristics of these orthogonal
polynomials?
They have approximately a mean of zero, the same standard deviation and a correlation of
zero (meaning that the relative size of the effects can be compared); the difference from
zero is due to imbalance in the data structure and rounding error.
Question 13: what are the differences between the fixed estimates and their
standard errors in the two models? Is the more complicated unstructured model a
significantly better fit to the data?
There are no differences of substance between the two models in either their fixed part
estimates nor in their associated standard errors. The difference in the deviance is
Calc b1 = 6802.133- 6785.074
17.059
The difference in the degrees of freedom is due solely to the nature of the random part;
there are 2 estimated parameters in the CS specification and 15 estimatable parameters in
the UN specification, a difference of 13 more terms.
cpro b1 13
0.19662
Which even when halved does not provide convincing support that we need the
considerably more complex unstructured model..
Question 14: is the Toeplitz model a significantly better fit than the compound
symmetry model? What are the implications of this result?
The difference in the deviance is
Calc b1 = 6802.133- 6801.760
0.37300
While the difference in the degrees of freedom is due solely to the nature of the random
part; there are 2 estimated parameters in the CS specification and 5 estimatable parameters
(the number of occasions) in the Toeplitz specification, a difference of 3 more terms.
cpro b1 3
0.94576
With such a high p value even when halved we should in the interests of parsimony prefer
the simpler model. There is little evidence that a more complicated dependency structure
than a compound symmetry model is needed.
- 227 -
Question 15: have the models been run for long enough? Do the estimates for pD
make sense? Which is the preferred model if the DIC criterion is used?
The ESS of the parameters for all the models suggests that 50k is a sufficient monitoring
length, and an examination of some individual trajectories suggests that there is no
problematic trending. The pD values have good face validity with the homogenous
compound symmetry model consuming the least degrees of freedom and the unstructured
model the most. Moreover each of the homogenous versions of the model are identified
correctly to have the simpler form with a smaller degrees of freedom consumed in the fit.
The ‘best’ model in terms of its DIC (the ability to predict a replicate dataset which has the
same structure as that currently observed) is that of the homogenous compound symmetry
model. Both forms of the autoregressive model are substantially a worse fit. There is little to
choose between the homogenous and heterogeneous compound symmetry. Overall there is
no evidence to prefer any other model than the simpler homogenous compound symmetry
Question 16: make a plot of the predictions from the fixed part estimates of the
results for boys and girls in urban and rural areas at different stages of
development. What do the results show?
Prediction of the fixed part into c50 (note standard errors are not implemented in MCMC
models)
Make a composite indicator of Sex and Rurality of the correct length
CALCulate c51 = 'Girl.12345' + (2* 'Rural.12345')
TABUlate 0 'c51'
Toggle categorical for c51 and edit the labels to be 0: Urban boy; 1 Urban girl; 2 Rural boy; 3
Rural girl.
Make a new variable for chronological age (this also has to be of the correct length)
CALCulate c52 =
'(Age1-gm).12345' + 11.493
- 228 -
Use customised plots
And plot the graph
Boys have greater endurance than girls over all the age range and the difference with boys is
greatest after the age of eleven as the girls development begins to level off but boys
continue to improve. Rural children have the greatest endurance and while this is true for
both boys and girls, the greater Rurality difference is for boys. The greatest endurance is
found rural boys.
- 229 -
15. Modelling longitudinal and cross-sectional effects
Introduction
This chapter is an extension of the last in that extends longitudinal analysis to deal
simultaneously with age and cohort effects. The argument is made by analysing two case
studies. The first is an extension of the Madeira Growth Study whereby its accelerated
design allows us to estimate cohort effects over time in addition to variation within and
between children. The second considers changing gender ideology in the UK using the
British Household Panel Survey. This is a repeated measures design and we will
simultaneously model longitudinal and cross-sectional effects. . A short section at the end
will consider recent developments in the analysis of age, cohort and period. We begin by
defining what we mean by age, period and cohort and consider the capacity of different
designs to isolate empirically these different conceptual effects.
Age, cohort and period in the Madeira Growth Study106
In studying change and development in human capability, it is possible to recognize three
separate components:

Age: this is the capability associated with how old the individual is, and how capacity
develops as individuals mature; age effects are internal to individuals;

Cohort: this is a group who share the same events with the same time interval;
cohort effects are associated with changes across groups of individuals who
experience common events. Cohort effects arise from the systematic differences in
the study population that are associated with age at enrolment;

Period: this is the specific time that the level of achievement or performance is
measured; period effects are external to the individuals and represent variation over
time periods regardless of age and of cohort.
The focus here is on birth cohorts who were born in the 1980’s who have lived through
Madeira’s economic transformation. The centrality of cohorts for social change was made
by Ryder (1965).107 A birth cohort moves through life together and encounters the same
historical and social events at the same ages, in this way they are changed and society is also
changed via replacement.
106
This section was written with Duarte Freitas, University of Madeira.
Ryder, N B. (1965) Thecohort as a concept in the study of social change, American Sociological Review
30,843–61.
107
- 230 -
Alternative designs for studying change
While growth researchers emphasize the maturation effects of aging, the observed changes
may also be a result of the child’s year of birth – the cohort- and the actual year the
observation was made – the period. The three terms are of course are linked by the identity
Age = Period – Cohort
so that a child aged 12 will have been born in 1980 if they have been measured in 1992.The
ability to disentangling the effects of these different time scales depends on the research
design that is used.
A cross-sectional study only provides data for just one period (everybody is
measured at the same time) so that it is impossible to estimate the effect for period which is
held constant. Moreover, at a fixed point in time, age and birth cohort differences are
confounded; it is not empirically possible disentangle age changes and cohort variations
with a cross-sectional design. If we were to find in a cross-sectional study that older children
have greater aerobic performance we cannot tell whether this is due to the maturation
effect associated with age, or the cohort effect such that this older group born into an
earlier cohort has always had a better performance. A longitudinal design is when children
are measured on more than one occasion. Consequently, the growth rate is directly
observed for each child. But a pure longitudinal study, as in the previous chapter, that
follows a particular cohort is also limited. The cohort has been kept constant by sampling so
any change could be either explained by age or period. Moreover, in a single cohort, age
and cohort are again confounded so that the improvement in aerobic performance may not
be the natural individual process of maturation, but is specific to this cohort as they
collectively age.
Consequently we need a longitudinal study that follows multiple cohorts. The most
efficient approach (Cook and Ware, 1983) is an accelerated longitudinal design (mixed
design or cohort sequential design).108 Distinct, sequential age cohorts are sampled and
longitudinal data are collected on members of each cohort. It is accelerated as it allows the
study of development over a long age span in a short duration. This design usually employs
an overlap between the cohorts, so that for each cohort, there is at least one age at which
one of the other cohorts is also observed. This overlap allows the splicing together of one
overall trajectory of development from the growth segments obtained for each cohort.
This design has a number of important advantages. As it is of short duration, there is
a shorter wait for findings, and less opportunity for sample attrition to accumulate. There is
also less time for the measurement team to be in the field which reduces costs. Moreover,
108
Cook, NR and Ware, J H (1983) Design and analysis methods for longitudinal research Ann. Rev. Public
Health. 1983. 4:1-23.
- 231 -
as pointed out by Schaie (1965) age, period and cohort are potentially un-confounded in this
design so that we can compare the development say of 12 to 14 year olds in more than one
cohort.109 The downside of accelerated design is that the data are sparser at the earliest and
latest ages where there is no overlap. The design is also problematic when there is
substantial migration as this process can erronesusly produce cohort effects. This is not a
problem however if attrition is kept low.
The raison d’etre of a longitudinal study is to characterise within-individual changes
in the outcome over time. The primary aim of the accelerated design has been to do this
efficiently. Bell (1953) recognized that the piecing together of cohort-specific growth curves
is only legitimate if there is ‘convergence’, that is no significant differences among cohorts in
the age-specific means where the cohorts overlap.110 For him and many others, cohort
effects are a threat. Consequently there are tests to guard against debilitating cohort effects
(Miyazaki & Raudenbush, 2000; Fitzmaurice et al, 2004, 306-309), and comparisons to see if
an accelerated design can recover the trajectories that would have been produced by a
‘proper’ long duration with a single age cohort (Duncan et al., 1996).111 The perspective
adopted here is radically different in that the accelerated design is seen as an opportunity to
assess the size and nature of social change through studying overlapping cohorts. Although
it may be conceptually possible to separate all three elements in an accelerated design we
focus on age and cohort only here and ignore period completely. That is we follow Guire
and Kowalski (1979) who argue that although stability over time is usually not true for
sociological studies, it is often true for studies of physical growth, and that the short-term
time effect is can safely be assumed to be zero.112
Table 11: The accelerated longitudinal design structure of the Madeira Growth Study
Cohort
Year
School Grade when measured
Born
1980
16
17
18
1982
14
15
16
1984
12
13
14
1986
10
11
12
1988
8
9
10
109
Schaie, K. W. (1965) General model for the study of developmental problems, Psychological Bulletin, 64, 92107.
110
Bell, R.Q. (1953) Convergence: an accelerated longitudinal approach. Child Dev. 24:145–52
111
Miyazaki, Y and Raudenbush S W (2000) Tests for linkage of multiple cohorts in an accelerated longitudinal
design, Psychological Methods 5:44-63; Fitzmaurice, G M , Laird, N M and Ware, J H (2004) Applied longitudinal
analysis, Wiley, New Jersey; Duncan, S. C., Duncan, T. E., & Hops, H. (1996) Analysis of longitudinal data within
accelerated longitudinal designs. Psychological Methods, 1, 236–248.
112
Guire, K. E., Kowalski, C. I. (1979) Mathematical description and representation of developmental change
functions on the intra- and inter-individual levels. In Nesselroade, J R Baltes, B P (eds) Longitudinal Research in
the Study of Behavior and Development, New York: Academic , pp. 89-110.
- 232 -
The Accelerated Longitudinal design of the Madeira Growth Study
The Madeira Growth Study is an accelerated longitudinal design (Table 9) in which five
overlapping age cohorts (participants of starting school grade of 8, 10, 12, 14, and 16 years)
were observed concurrently for 3 years, thus providing information spanning 10 years of
development from a study lasting just three. A stratified sampling procedure was used to
ensure the representativeness of the subjects. At the first stage, 29 state-run schools were
selected taking into account the geographical area, the school grade and sport facilities. In a
second stage, a total of 507 students were recruited according to the proportion of the
population by age and sex in each of Madeira’s eleven districts. Initially, complete records
were obtained for 498 children so that the dropout is only 9 children or less than 2 per cent.
All of these had only completed the first aerobic test but had not answered the social
questions. These children were simply excluded from the study. The MGS is naturally
hierarchical with measurements on up to 3 occasions at level 1 on 498 children at level 2.
Specifying and estimating cohort effects
Differential cohort effects can be handed by a three-level model in which occasions are
nested in individuals who are seen as nested in birth cohorts. The combined model in its
random-intercepts form is
yijk  0 x0ijk  1x1ijk  2 x2 jk  3 x3k  (v0k  u0 jk  e0ijk )
(1)
where the dependent variable yijk is distance covered on occasion i by child j of cohort k.
There are three predictors: Age which is time-varying at the occasion level ( x1ijk ); Gender
which is a child level variable ( x2 jk ) and the Cohort number ( x3k ) centred around 1984 so
that 1980 is -2, 1982 is -1, 1984 is 0, 1985 is 1, and 1988 is 2. Once again the ’s are the
averages and is now the mean distance achieved by a boy of average age born in 1984,
and is the linear change in distance achieved for a Cohort that is two year later. The
random part has three elements:
which is the unexplained differential achievement for
a cohort around the linear trend;
which is the unexplained differential achievement
for a child given their age, gender and cohort; and
is the unexplained occasion-specific
differential given the child’s age, gender, cohort and differential performance. The
distributional assumptions complete the model.
v0k  ~ N ,0( v20 );
u  ~ N ,0(
0 jk
2
u0
);
e  ~ N ,0(
0ijk
2
e0
)
(2)
The terms give the residual variance between cohorts, between children and between
occasions. If we want to see how sub-groups of children have changed differentially through
time we can include terms in the fixed part of the model to represent different groupings of
children and then form a cross-level interaction with the linear cohort variable.
- 233 -
The conventional wisdom is that when there are only a small number of higher-level
units, they are more appropriately specified as fixed-effects dummies. However, this is
contradicted by the comparative study of Yang and Land (2008) that assessed the extent to
which the values of an Age-Period-Cohort model could be recovered.113 They found
unequivocally that random effects achieved better results even when the number of units
was small and comparable to the present study (5 cohorts). The Yang and Land study was
based on restricted maximum likelihood (RIGLS) estimation which would give empirical
Bayes estimates of the random effects. These estimates can be improved upon by adopting
a Bayesian approach using Markov-Chain Monte-Carlo procedures (Browne and
Draper,2006).114 There are two aspects to this. First, the full Bayesian analysis better
accounts for uncertainty in that inference about every parameter fully takes into account
the uncertainty associated with all other parameters. Second, when there are few higher
level units the sampling or posterior distribution of the variance of the random effects is
likely to be positively skewed as negative estimates of the variance are not possible. Both
these problems are at their worst when there are a small number of units at a level and
there is substantial imbalance. While the latter is not the case in the present design, the
former certainly is. This has been recognized in the APC literature and Yang (2006)
evaluated the performance of REML empirical Bayes compared to fully Bayesian ones,
finding that the latter provide a more secure base for inference.115
The Bayesian approach brings its own difficulties notably ensuring that the estimates have
converged to a distribution and that the required prior distributions have to be specified
before estimation. Common practice is now to specify diffuse priors that impart little
influence to the estimates and allow the observed data to be the overwhelming
determinant of the results. Yang (2006) experimented with a number of alternative diffuse
priors and found that the APC estimates were fortunately largely insensitive to the
particular choice that was made.
In practice we are going to estimate the models with the MLwiN software which provides
REML and Bayesian estimates using MCMC estimation with default diffuse priors. The
Bayesian approach is highly computer intensive and we will use the REML estimates as good
starting points for the MCMC simulation. We ran the MCMC procedure for a burn-in of 500
simulations to ‘get away’ which from the REML estimates and then for a subsequent initial
50000 draws. At the end of this monitoring period each and every estimate was checked for
convergence which is characterised by ‘white noise’. The existence of a trend would mean
that the sampler has not reached it equilibrium position and a longer burin-in would be
113
Yang, Y and Land K C (2008) Age–period–cohort analysis of repeated cross-section surveys: fixed or random
effects? Sociological Methods and Research, 36, 297-326.
114
Browne, W. J. and Draper, D. (2006). A comparison of Bayesian and likelihood-based methods for fitting
multilevel models. Bayesian Analysis 1: 473-550
115
Yang , Y (2006) Bayesian inference for hierarchical age-period-cohort models of repeated cross-section
survey data, Sociological Methodology 36:39-74.
- 234 -
required. The monitored estimates were also assessed for information content and further
monitoring simulations were undertaken until the effective sample size of the Markov
draws was equivalent to 500 independent draws. Once sufficient draws had been made the
estimates were summarised in terms of their mean and 2.5% lowest and highest values to
give the Bayesian 95% credible intervals. A sensitivity analysis was undertaken using a
number of diffuse priors but this made little difference.
The overall approach to model development is based on fitting model of increasing
complexity and assessing whether the more parsimonious form should be retained. As the
modelare estimated by Bayesian procedures we have used the Deviance Information
Criterion (Spiegelhalter et al., 2002). 116 This a goodness-of-fit measure penalized for model
complexity. As such it is a generalization of the Akaike Information Criterion but unlike this
measure, the degree of complexity is estimated during the fitting process. Lower values for
a DIC suggest a ‘better’ more parsimonious model. Any reduction in the DIC is an
improvement, but following experience with AIC, differences greater than 4 suggest that
that the model with the higher DIC has considerably less support .
Table 12 Results for MGS: age and cohort
Terms
Fixed Part
Cons
Age-13
(Age-13)2
Girl
(Age-13)*.Girl
(Age-13)2 *Girl
Cohort-2
(Cohort-2)*Girl
(Cohort-2)*(Age-13)
(Cohort-2)*Girl*(
Age-13)
Random Part
Between cohorts
Between Children
Between Occasions
One
DIC:
Diff in Dic
6908.44
SE
Two
SE
Three
SE
Four
SE
CI(2.5%)
CI(97.5%)
ESS
21.509
0.843
0.003
-2.792
-0.513
-0.030
0.199
0.050
0.014
0.287
0.070
0.020
21.574
0.769
-0.010
-2.794
-0.517
-0.030
0.456
0.077
0.016
0.282
0.071
0.020
21.570
0.699
-0.007
-2.790
-0.518
-0.030
-0.384
0.339
0.077
0.016
0.282
0.071
0.020
0.236
21.115
-2.695
-0.095
-2.353
3.525
0.084
-0.758
0.684
-0.294
0.342
0.234
1.257
0.040
0.339
1.802
0.056
0.225
0.320
0.112
0.161
20.658
-5.154
-0.172
-3.019
-0.019
-0.027
-1.201
0.058
-0.514
0.026
21.579
-0.239
-0.018
-1.692
7.062
0.194
-0.318
1.311
-0.076
0.657
14973
12127
20669
14557
12038
20290
40684
40873
12012
11926
4.804
4.795
0.424
0.218
0.669
4.791
4.791
2.066
4.010
4.381
0.353
4.738
4.772
1.032
0.412
0.218
0.193
4.687
4.777
0.569
0.411
0.216
0.032
3.931
4.373
1.120
5.538
5.222
1149
17636
23196
6902.6
5.84
6902.0
6.44
6898.4
10.08
Modelling age, sex and cohort
This section reports the results of a series of models involving age, sex and cohort. The
results are shown in Table 11. Model 1 is a two-level growth model where the value of the
runs test is modelled as a function of quadratic chronological age and there is an interaction
with gender. As it is a two-level model, the data has been effectively pooled across cohorts.
The shape of the growth curve is most readily appreciated in Figure 12 which shows the
curves with 95% credible intervals. Clearly at younger ages there is little difference between
116
Spiegelhalter D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model
complexity and fit. Journal of the Royal Statistical Society, Series B 64: 583-640.
- 235 -
the sexes and while both boys and girls develop with age the rate of change is greater for
boys. The curve is convex for girls and there is a flattening of growth by the age of 18. The
general shape of the trajectory is consistent with past research on the association between
age and performance as is the differentials by sex.
The random part of this model consists of a between-child random-intercept variance and a
within- child between-occasion variance The residual correlation between occasions in the
compound symmetry model is equal to 0.50 so that there is quite a lot of dependency
across time; children with a relatively high performance on one occasion tend to have a
relatively high performance on other occasions. In fact this is the most complicated twolevel model that is supported by the data when DIC values are compared. There is no
evidence of a cubic term nor that a random-slopes model is needed. Children differ in their
aerobic performance but there is no evidence that this variance increases or decreases with
age. Moreover, a model with the more complex unstructured covariance between children
was not an improvement on this compound symmetry model. Consequently, there is no
evidence that the residual dependency overtime is changing with occasion.
The rest of the estimates in Table 11 are for a three-level model where random intercepts
are additionally included at the cohort level in the manner of equation (1). Model 2 simply
includes the random intercepts for cohorts, Model 3 additionally includes a linear cohort
term, while Model 4 includes the full three-way interaction of linear age by sex by cohort.
This latter model allows for the possibility that the cohort effects are differential for boys
and girls and for different ages. Comparing the DIC of each of three level models to that of
the two-level model shows that the more complex models are an improvement over a
model with no cohort effects. Moreover, the model with the lowest DIC is Model 4 when
the cohort effects are differentiated by age and sex. The table also gives the 95% credible
intervals for this model and the effective degrees of freedom (a MCMC simulation of 50k
draws was used). It is noticeable that for each of the terms involving the cohort that the
95% credible intervals for each of the terms do not include zero suggesting that each term is
- 236 -
required. Even with less than a decade separating the earliest and latest birth cohorts there
is evidence of a difference.
Model 4 with its complex interaction terms is most readily appreciated graphically by
plotting the estimated growth curves for each cohort separately for boys and girls. The plots
for girls show convergence and there is no evidence whatsoever of cohort effects as each
segment of the growth curve overlaps the next. However, for boys there is evidence of
cohort differences and these are bigger for the earlier cohorts. The nature of the change
can be seen by plotting the predicted mean performance and 95% intervals for boys and
girls aged 14 using MLwiN’s customised predictions facility. The decline in the aerobic
performance with later cohorts for the boys is marked and this contrasts with the lack of
change for girls.
- 237 -
Using an appropriate multilevel analysis we have been able to ascertain the intracohort development in aerobic performance and evaluate inter-cohort changes. The
research design facilitated simultaneous estimation of age effects, cohort variations, and
age-by-cohort interaction effects on aerobic performance. Both sexes show increased
capacity as they develop and this is most marked for boys so that on average there are
substantial differences between boys and girls at older ages. There is quite strong
consistency between measures for the same child on different occasions so that children
with high or low capacity appear to maintain this over time. Moreover, the variance around
the average growth does not change with age so that there is no evidence for converge nor
divergence in aerobic capacity across the years of childhood and adolescence.
There are however marked differences between cohorts so that boys born less than a
decade part have noticeably poor aerobic performance. Girls in contrast have not
experienced this decline and have maintained their generally lower rate. For boys the
decline is most noticeable in the older age groups. This decline occurred to cohorts that
experienced rapid social change and we can speculate that this was related to a move away
from the land, more sedentary forms of living and much greater use of motorized forms of
transport.
Modelling a two level model: Mundlak formulation
The three-level model we have just is fitted is the most appropriate models for examining
the nature of cohort change. But it is interesting in light of the previous chapter to fit a twolevel model without and with the group means- the Mundlak formulation in its contextual
form. The results are shown in Table 13. The first model includes a quadratic polynomial of
age for both Boys and Girls and the second model additionally includes Child Average Age
(grand mean centred to allow for an interpretable intercept) and in the light of the
modelling above, an interaction between the Girl dummy and Child Average Age. The
models were estimated in MCMC with 50000 monitoring simulations after 500k burn in.
Table 13 Comparing models with and without group means for Age
Base
S.E.
+AvAge*Girl
S.E.
Fixed Part
Cons
21.512
0.199
21.434
0.200
(Age-13)^1
0.846
0.051
0.602
0.098
(Age-13)^2
0.003
0.014
0.003
0.014
Girl
-2.796
0.284
-2.721
0.284
(Age-13)^1.Girl
-0.516
0.071
-0.287
0.140
(Age-13)^2.Girl
-0.030
0.020
-0.030
0.020
(AveAge-gm)
0.326
0.113
Girl.(AveAge-gm)
-0.307
0.161
Random Part
Children
4.793
0.421
4.777
4.014
- 238 -
Occasion
DIC:
4.795
6907.749
0.216
4.790
6902.616
4.382
There are a number of points to note

The model with the two extra parameters is an improvement in the DIC of around 5
so that there is evidenced of a contextual effect; a Hausman test would show
significance;

The contextual effect for boys is positive, so as the Child Mean Age goes up (in effect
a later cohort of 1 year) the endurance goes up by 0.326. If we mix types of inference
and conduct a Wald test on this parameter; it is highly significant, (p is given by cpro
8.326 1;0.004). For boys their baseline age, the cohort into which they were born,
matters.

The comparable differential slope for girls is -.307 so the baseline age matters much
less for them, and we can use a Wald test to see if the cross-sectional slope for Girls
is different from zero (note the two 1’s to specify the full slope effect for Girls). The
slope for girls is only 0.019 and it has a very small insignificant p value. There is no
evidence of a contextual effect for Girls.
- 239 -

The linear effect of Age for Boys reduces form 0.846 to 0.602 and the linear
differential for Girl attenuates from -0.516 to -0.287 as the group means are
included; not including the means would be to commit omitted-variable bias. We can
use the Customised predictions to make a plot of the predicted endurance for Boys
and Girls for the first and second model.
Figure 14 The effect of individual Age on endurance: estimated with and without Child
Mean Age
There is no difference whatsoever for girls but the longitudinal relationship with Age
for
Boys in the model without Child Average Age is too steep; it has been biased
upwards.
We can make sense of these results by correlating Child Average Age with Cohort and get
the value of minus 0.9948; they are measuring the same thing – children with an older
average age in the accelerated design were born into an earlier cohort. Moreover, if we fit a
variance components model with child Age as the outcome, we see that cross-sectional
between-children Age is much larger than between occasions. This is a consequence of the
accelerated design, we have only followed individuals for three years but have selected
children aged 8 to 16 at baseline to follow.
- 240 -
Thus it is necessary for a predictor variable to have a between-children, time-invariant
element to estimate cross-sectional effects, but it is not sufficient as we did not find a crosssectional effect for Girls.
Importantly the Mundlak method allows us to ‘model out’ the contextual cohort
effect and distinguish it from individual longitudinal growth. The model with Child Mean Age
included was used to make a set of prediction shown in Figure 15; the left hand side is the
predicted longitudinal within child growth with the Mean age held steady at 13, while the
right-hand side shows the effect of mean change with individual age held steady also at 13.
The vertical and horizontal axes are set to the same scale on both graphs.
Figure 15 The longitudinal and cohort effects of Age in predicting Endurance
- 241 -
The accelerated design has done its job when allied to this type of modelling. We have
gathered information efficiently in a short period, we have identified that there is an
element of cohort change for Boys so that the younger-aged cohorts have less endurance,
and we also have now characterised individual growth across the age-span. Compare this
with a fixed-effects approach of the previous chapter where we would have not been able
to even examine Boy-Girl differences as these are time-invariant! A significant Hausman test
does not simply mean we have an endogeneity problem, but gives an opportunity to explore
the different processes that are at work
Changing gender ideology in the UK117
This study uses data from the British Household Panel Survey, a nationally representative
sample of some 5500 household which was drawn at the start of the survey in 1991, giving
close to 10,000 individual respondents. Subsequently, individuals have been traced and reinterviewed each year, generating annual panel data. The BHPS includes an extensive range
of questions and every two years a section on beliefs, values and attitudes is incorporated.
This contains 8 items which can be used to measure an individual’s gender ideology. These
items given in the table below consist of a statement such as “A pre-school child is likely to
suffer if his or her mother works”, to which a respondent may answer “strongly agree”,
“agree”, “neither agree nor disagree”, “disagree”, or “strongly disagree”.
Item
1
2
3
4
5
6
7
8
Gender role attitude items in the BHPS
A pre-school child is likely to suffer if his or her mother works
All in all, family life suffers when the woman has a full time job
A woman and her family would all be happier if she goes out to work
Both the husband and wife should contribute to the household income
Having a full-time job is the best way for a woman to be an independent person
A husband’s job is to earn money; a wife’s job is to look after the home and family
Children need a father to be as closely involved in their upbringing as the mother
Employers should make special arrangements to help mothers combine jobs and childcare
In this study, the 5-point Likert scale on which responses are measured is recoded to a 3-point scale
and the coding for items 1, 2 and 6 is inverted for consistency. The response alternatives used are
“Traditional”, “Neutral” or “Egalitarian”, with high values denoting an egalitarian response. A
multilevel item response model118 of the eight items found that the items gauge gender role
attitudes in very different ways and cannot usefully be combined into a single measure. The most
discriminating of these items (in the sense of being most strongly related to the underlying latent
ideology score) is “All in all, family life suffers when the woman has a full time job”, suggesting that it
makes the best single measure. This response was then used in a multilevel model with to assess a
117
This section reports work undertaken with Laura Steele.
Adams,RJ, Wilson, M and Wu, M (1977) Multilevel item response models: an approach to errors in
variables regression, Journal of Educational and Behavioral Statistics, Vol. 22, 47-76.
118
- 242 -
number of questions about change. The BHPS is naturally a multiple cohort study as at the start of
the survey individuals were aged from 18 to 90 plus. We will also include areas effects as we are
interested in the geography of the outcome.
The model is built in several stages
1
2
3
4
5
un-ordered multinomial logit to deal with the 3 unordered categories of the
response, this is treated as a special case of the multivariate model;
repeated measures to deal with a responses on different occasions, this allows
for dependence over time and the correct modelling of the age or developmental
effect;
random cohort effects to model the differences between birth cohorts;
a cross-classified model as respondents can relocate so that individuals can be
seen as belonging to different local authority areas at different times;
steps 1 to 4 form the base or empty model which effectively models the
individual, temporal and spatial variation; subsequently fixed effects for age,
cohort and time-varying and time-invariant variables for individuals and their
neighbourhoods are then included as main effects and interactions to account
for this variation.
Building the base model
Stage 1Unordered multinomial model
The unordered single-level multinomial model can be written succinctly as follows:
[
( )
( )
( )
]
( )
(3)
( )
where i refers to an individual, the response has t categories and
is the underlying
probability of being in category s, is a predictor variable and the ’s are regression-like
parameters linking the response to the outcome. The Expectation operator is used to signify
‘on average’ as we have no stochastic element in this model. One of the categories, signified
by t is taken as a reference category and this plays the same function as 0 in a binary
outcome model. In effect there is then a set of t-1 equations where if a logit link is chosen,
the log-odds of each of the other categories in comparison to the base category is regressed
on predictor variables. With t equal to 3 categories as here, there are 2 equations:
(
[
(
[
(
)
(
)
])
(
)
(
)
])
(
)
(
(
)
- 243 -
)
(
(4)
)
(5)
Typically separate intercepts and slopes and the same predictor variables are included for
each line of the specification. A logit formulation is used to constrain predicted probabilities
between 0 and 1.
Each slope parameter is interpreted as an additive effect of a one unit increase in the
associated predictor variable on the log-odds of being in category s in comparison to the
referent category t. It is convenient and recommended that the model is interpreted by
converting the estimated logits to probabilities as the logits can even mislead even about
the sign of the relation.119 The predicted values for the non-referent category are given as
follows as follows:
*
( )
(
+
∑
( )
(
( )
( )
)
( )
(6)
)
So that it is clear that all equations have to be involved in the transformation to
probabilities; it is not just the equation involving a single category and that is why the logits
can be uninformative. The probability of being in the base category is the remainder from 1
minus the summed probability of all other categories:
*
( )
( )
∑
+
(7)
MLwiN uses the customised prediction facility to transform from log-odds to probabilities,
and in random effects models these can be subject-specific or population average values.
With these models with multiple nominal outcomes it is natural to use a multinomial
residual distribution:
(
( )
( )
)
(
( )
( )
)
( )
( )
(
( ) ( )
)
)
when
when
(8)
Where is a vector of 1/0 responses then , the denominator is equal to 1, or if the
response is an observed proportion in the category,
is the denominator of the
proportion. This is achieved in MLwiN by regarding the set of outcomes at level 1 as nested
within individuals at level 2 and in a similar fashion to the binomial logit by calculating a
multinomial weight based on the predicted probability in each category and constraining
the variances of these weights to 1. This is an exact residual multinomial distribution; overdispersion may be permitted when modelling proportions.
119
Retherford, R D and Choe, M K (1993) Statistical Models for Causal Analysis, John Wiley & Sons, New York.
- 244 -
The modelling begins by specifying that the response is a categorical variable with 3
outcomes and that there is a two-level structure in which level1 is the choice category and
level 2 is a unique identifier for each person at each wave. We choose the base category to
be a Traditional viewpoint on gender ideology and therefore model the log-odds of Neutral
in comparison to Traditional, and Egalitarian in comparison to Traditional. The fixed part of
the model consists of two parameters associated with the Constant, and these give the
average log-odds across all waves and across all individuals. This is a multivariate model so
that there is no level 1 variance, while the level 2 variance is an exact multinomial
distribution in which the variance depends on the mean probability of being in each
category.
The 54 thousand observations representing some 9 thousand individuals measured over 9
waves 1991, 1993,1995… 2007) doubles to 108 thousand observations when account is
taken of the two non-referent responses.
The model was estimated initially by IGLS and then by MCMC estimation with an initial
burn-in of 500 simulations followed by 5k simulations. The estimates are as follows
When the customised predictions facility is used to transform the values back to a
probability scale we get the following results.
Median
95% CI’s
Median
95% CI’s
Category Probability Lower Upper Probability Lower Upper
Trad
0.383 0.380 0.388
0.383 0.380 0.388
- 245 -
Neutral
Egal
0.290
0.326
0.286
0.322
0.294
0.330
0.290
0.326
0.286
0.322
0.294
0.330
As there are no higher-level randomeffects the median and the mean (the subject-specific
and population average results, Chapter 14) give the same estimates. These can be
compared to a simple tabulation with percentages:
Trad
20779
38.4
N
%
Ntral
15720
29.0
Egal
17635
32.6
TOTALS
54134
100.0
So that the most common declared category is the Traditional (by a small margin) while the
least common is the Neutral standpoint.
Stage 2: a model with hierarchical structure for waves nested in individuals
The second model sees repeated measures as being nested within individuals. The general
from of the model is
( )
[
where
( )
( )
]
( )
( )
( )
(9)
is an individual-level random effect for each contrasted non-referent category
which are assumed (on the logit scale) to have a mean of zero and a variance of . These
individual random effects may be correlated through a covariance term. The two variance
terms here represents the between individual variance in the Egalitarian: Traditional and
Neutral: Traditional log-odds ratio. There are of course missing observations and we are
invoking the MAR assumption (Chapter14) so that the response on gender ideology is
presumed not to affect the missingness, and therefore the missingness mechanism does not
require explicit modelling.
In practice in MLwiN, this is modelled as a three-level hierarchical structure with the
choice set at level 1 nested within person-wave at level 2 nested within individual at level3.
Again the fixed part of the model is kept simple with just two constant terms to specify the
average but additionally we now have variance terms for individuals for each of the
contrasted response outcomes.
- 246 -
The monitoring phase of the MCMC simulation was increased to 15k. On completion of the
monitoring period the following results were obtained.
The between-individual variances on the logit scale are very large particularly for the
Egalitarian response. This means that in this unconditional model that there is a great deal
of similarity within a person over wave; individuals are not changing their category and are
staying with their preference and this is particularly the case for the Egalitarian response.
We can again use the customised predictions window and we will choose both subjectspecific medians and population average means with their 95% confidence intervals. The
median gives median probability of an individual across the 9 waves and the category with
the highest value remains the Neutral one while the Egalitarian is quite a bit lower.
Reasonably similar values are found for the mean probability despite the size of the random
effects. Notice too that both sets of confidence intervals have widened in comparison to the
previous model as we are not inferring to individuals in general but to the ‘average’
individual across waves.
- 247 -
Category Median probability and 95CI’s
Median
Low
High
Trad
0.448
0.432
0.465
Neutral
0.409
0.395
0.422
Egal
0.143
0.129
0.159
Mean
Mean
0.465
0.422
0.159
probability
Lows Highs
0.415 0.432
0.281 0.298
0.276 0.298
Stage 3: a model with random cohort affects
The next model additionally includes random effects for cohorts which are defined by the
calendar year in which the person was born. There are 78 birth cohorts running from 1894
to 1973. Again we keep the fixed part of the model simple with just the two intercepts for
the non-referent category but include variance-covariances at the cohort level.
We increased the monitoring phase of the MCMC estimates to 50k and here are the results.
- 248 -
Clearly there are sizeable variations between cohorts particularly for the Egalitarian
category and the between individual variances also remain very large
The estimated cohort residuals show a clear outcome.
Estimated differential Logit vs Birth Year
5.0
Category
Egalitarian
Neutral
Logit
2.5
0.0
-2.5
-5.0
1890 1900 1910 1920 1930 1940 1950 1960 1970 1980
Birth Year
There has been a marked and consistent rise in the Egalitarian category with successive
birth cohorts (the random effects, of course, have no knowledge or earlier or later), while
the rise in the Neutral category has also been consistent if less marked. This must mean that
the Traditionalists have declined.
Stage 4: a cross-classified model for area effects
The next model includes a random effect for nearly 400 local authority areas. As individuals
can be expected to re-locate across the sixteen years of this study, we now require a crossclassified model. Again we keep just two constants in the fixed part but additionally include
two random effects for the non-referent categories at the LAD level.
- 249 -
The model now has the following five classifications.
After 50k MCMC monitoring simulations, the following results were obtained.
In this unconditional model, although much smaller than the between-individual and
between-cohort random effects, that there are now quite large area effects at this macro
scale of groups of LAD’s.120 We can use Larsen’s MOR procedure to get some handle on
these and calculate the MOR ratio to be 2.1 for the Neutral category and 3.1 for the
Egalitarian. Geography appears to make quite sizeable differences to attitudes to gender
ideology. The plot of the residual differential logits shows that there is a quite strong
positive correlation between Neutral and Egalitarian. The correlation, obtained from the
estimates table between these latent values is 0.88
120
The LAD’s in the BHPS are not actual local authority districts but groups of such areas LADs which were
combined if their population fell below 120,000 in 1991 (for reasons of preventing disclosure).
- 250 -
Including longitudinal and cohort effects
Having estimated this base model with its temporal and spatial effects we are now going to
include terms in the fixed part of the model that try to account for these variations. We
begin with longitudinal and cross-sectional effects. Due to the linear dependency between
Cohort, Age and Period (here represented by Wave) we can only include two of these
elements in the model, and clearly we have to be careful not to misinterpret one for the
other. Cohort effects are cross-sectional terms and these could be included in the model in a
number of equivalent ways such as time-invariant Age measured in 1991 at the start of the
survey; the group mean age of the respondents, or the year of birth of the respondent. We
have chosen the latter as it gives a simple metric by which to portray change, but the choice
is solely one of convenience. This variable is entered into the model as a linear term at first
centred on its grand mean of 1948. To specify the longitudinal effect we could have chosen
the time-varying variable Age in the model but of course we have not observed the
maturing affect over the adult lifespan as the panel survey is limited by the years of
measurement to 1991 to 2007. We have therefore chosen to include the year of the survey
as the longitudinal effect centred on 1998; initially as a linear term. This model is equivalent
to the contextual Mundlak formulation of the previous chapter.
The initial model is specified as follows:
- 251 -
After 50k monitoring simulations the estimates are as follows.
All four slopes represent the change in the logit of the outcome for one year and in that
sense are directly comparable. There are some interesting patterns. The Egalitarian
outcome has a positive slope associated with Date of Birth that is larger in absolute value
than the negative slope for Year. For the Neutral outcome both slopes are positive. It is
easiest to appreciate the scale and nature of the effects by characterising the change for
some stereotypical people. Here we will choose the 10, 25, 50, 75 and 90 percentiles of the
Birth Cohorts and plot them against year, distinguishing the propensity of all three
outcomes. We can do this through the customised predictions window.
- 252 -
The probability of choosing the Egalitarian outcome is most marked by Birth Cohort with
older cohorts have a much lower chance of agreeing with this position. With time passing
however all the cohorts show a decline in their probability of agreeing with the Egalitarian
stance. The proportion agreeing with Traditionalist position similarly shows strong Birth
Cohort effects and also a decline over time in support for this position. The probability of
choosing the Neutral stance is much less differentiated by when people were born and this
position has seen an increase in support over time.
An exactly equivalent way of looking at this to take the specified birth year away from the
calendar year of the survey to derive the time-varying age of the responden.t We can then
plot the predicted probability against changing age; they make look very different but they
are the same model results just portrayed in another way. Unlike the aerobic performance
example, it is quite plausible that events associated with period may have affected the
response as well the processes of ageing per se.
- 253 -
We then fitted more complex quadratic and cubic models for Year and Birth finding that the
DIC improved substantially with the quadratic compared to the linear but worsened with
the cubic. There was also no evidence of an interaction effect between Year of Birth and
Year. The predicted probabilities for the quadratic model show essentially the same results
as the linear model.
Including Gender as a main effect and as interactions
The next model includes a time invariant predictor, for Sex and does so in a quadratic
interaction with Year.
- 254 -
The basic patterns are the same with a large cohort effect for the both extreme categories,
but females in the earlier cohorts show less traditional views.
- 255 -
Age, period and cohorts?
It has long been held that why it is conceptually possible to separate age, period and cohort,
it is not technically possible to do so due to the linear dependence between the three terms,
the identification problem. This has recently been challenged by a set of papers by Yang and
Land. 121 Their solution has two parts. First micro-survey data such as the BHPS are used to
form bespoke groupings such that while period remains on an annual basis, cohorts consist
of data for a five year period – this breaks the linear dependence between the terms.
Second they use random coefficient modelling to analyse complex cross-classification of
individual respondents (level-1) nested within cells created by a cross-classification of birth
cohorts and time periods. They are in fact using a main-effects cross classification of random
terms for some terms elements of the APC model and a fixed term for others. Thus we could
fit a model with quadratic term for Birth year in the fixed part of the model and random
effects for each Wave alongside random effects for (say) five-year age groups.
The problem with this approach is that is unclear which term should be put in to the fixed
part and what is a meaningful grouping of age, period or cohorts to make for identifiability.
Glenn (1989) has been highly critical of what he calls ‘mechanical approaches’ 122
The reason why no mechanical or automatic method of cohort analysis can always
properly sort out age, period, and cohort effects is simple: when the effects are
linear, several different combinations of the different kinds of effects can produce the
same pattern of variation in the data. In other words, different combinations of age,
period, and cohort influences can have identical empirical consequences. Therefore, a
proper labelling of the effects requires a priori knowledge of the nature of the effects
or some kind of information from outside the cohort table.
If different Age , Period and Cohort processes can yield identical observable outcomes, it
must be that the observed data of itself, cannot tell us which combination is at work – it is
not possible to make bricks without straw. Indeed despite considerable experimentation
with different terms in the fixed part and different alternative groupings of age and birth
121
Yang, Y and Land K C (2006) A mixed models approach to the age-period-cohort analysis of repeated crosssection surveys, with an application to data on trends in verbal test scores Sociological Methodology 36:75-98;
Yang, Y and Land K C (2008) Age–period–cohort analysis of repeated cross-section surveys: fixed or random
effects? Sociological Methods and Research, 36, 297-326; Yang , Y (2006) Bayesian inference for hierarchical
age-period-cohort models of repeated cross-section survey data, Sociological Methodology 36:39-74;Yang, Y
(2007) Is old age depressing? Growth trajectories and cohort variations in late life depression Journal of Health
and Social Behavior 48:16-32;Yang, Y (2008) Social inequalities in happiness in the United States, 1972 to 2004:
an age-period-cohort analysis American Sociological Review, 73: 204-226; Smith, H L (2008) Advances in
Sociological Methods & Research, 36:287-296.
122
Glenn, N D (1989) A caution about mechanical solutions to the identification problem in cohort analysis:
comment on Sasaki and Suzuki, American Journal of Sociology 95 (3), 754-761. Glenn, N D. (1976) Cohort
analysts' futile quest: statistical Attempts to sepa rate Age, Period, and Cohort effects, American Sociological
Review, 41, 900-904.
- 256 -
year, we were unable to fit all three APC terms simultaneously without the MCMC estimates
‘blowing up’; this suggests that identifiability remained a real problem.
What we have learnt

The multilevel model can estimate differential cross-sectional and longitudinal
effects within an overall repeated-measures framework. It is important to do so
because the relatively-enduring processes may have quite different effects from
changeable longitudinal elements.

The multilevel model can do this for discrete responses such as the unordered
categorical used in the gender ideology example, and for models with added area
effects.

Separately estimating Age, Period and Cohort elements is technically difficult using
grouping and random effects and this remains a mechanical approach. In the study
of the development of aerobic performance we can effectively rule out a priori
period effects, but in political voting behaviour we could not discount events like
‘Jennifer’s ear’ or ‘where’s the beef’ alongside long-standing socialization of being in
a post-war cohort, nor maturation processes as people age. Quantitative technique
has to be tempered by real-world understanding. The entity being studied makes
areal difference and that would account for the lack of debate about APC in the
biostatistics literature where it is often entirely plausible to discount period effcets.
- 257 -
Chapter 16 The analysis of spatial and space-time models
Introduction:
The standard multilevel models treat space in a rather rudimentary way so that individuals
at level 1 are seen as nested in neighbourhood at level 2 and districts at level 3. This forms a
hierarchical partitioning of space and we can calculate what percentage of the variance lays
at each level and hence the degree of dependence or autocorrelation at that level. Indeed
this random-effects approach based on a null model was the basis for an early classic paper
on geographical analysis.123 The rudimentary nature of the model can be appreciated from
Figures 17. In this standard model each neighbourhood is treated as a separate entity, there
is no knowledge in the model of which areas are next to each other. The random-effects
model is based on exchangeability and the results are invariant to the location of areas; we
can move areas about without affecting the results as in Figure 18.124 In contrast to the
multilevel approach, the spatial econometric tradition125 emphasizes that the location of
areas does matter and we can conceive of interactions between areas that must be
accommodated in the model as ‘interaction’ or spillovers between adjacent areas as in
Figure 19. Although this notion is not generally used in the literature it is helpful to think in
terms of wider areas that surround each area and what we will call spatial ‘patches’. The
red lines on Figure 20 show the ties of the adjacency for a particular neighbourhood, area 10
and consequently they define a wider area or ‘patch’ that is areas 10, 7 and 11. There are 13
areas on this map so that there 13 patches that overlap to some extent. The spatial
multilevel model allows exchangeability of information within these pre-defined patches,
and that is how additional spatial dependence is accommodated in the multilevel model.
Although the two traditions of multilevel modelling and spatial modelling have
evolved separately there are now a number of applications in which both approaches are
used simultaneously. These include the following papers by authors who are from the
multilevel side of the house: Verbitsky-Savitz and Raudenbush (2009); Leyland (2001);
Leyland et al (2000); Langford et al (1999); Jackson et al (2006) and Best et al (2005).126
123
Moellering, H. and Tobler, W. (1972) Geographical Variances Geographical Analysis, 4: 34–50.
More formally, exchangeability is that there is no systematic reason to distinguish particular areas. In
essence we are assuming that we can permute the labels of the areas without affecting the results. De Finetti
in 1930 derived much of the apparatus of modern Bayesian statistics from this assumption.
125
L. Anselin (1988) Spatial econometrics: methods and models. Dordrecht: Kluwer Academic Publishers.
126
Verbitsky-Savitz, N., and Raudenbush, S.W. (2009). Exploiting spatial dependence to improve
measurement of neighbourhood social processes. Sociological Methodology, 39 (1), 151-183; Leyland AH
(2001) Spatial analysis. in: Leyland AH, Goldstein H (eds) Multilevel Modelling of Health Statistics. Chichester:
John Wiley & Sons, 2001:143-157; Leyland A, Langford IH, Rasbash J, Goldstein H. (2000) Multivariate spatial
models for event data. Statistics in Medicine, 19,2469-2478; Langford, I. H., Leyland, A. H., Rasbash, J., and
Goldstein, H. (1999) Multilevel modelling of the geographical distributions of diseases. Applied Statistics,
48,253–268; Jackson C, Best NG, Richardson S. (2006) Improving ecological inference using individual-level
data. Statistics in Medicine 25:2136–2159. Best N, Richardson S & Thomson A (2005) A comparison of Bayesian
spatial models for disease mapping. Statistical Methods in Medical Research 14, 35–59.
124
- 258 -
Figure 17 Neighbourhood influence in the standard multilevel model (Elffers, 2003)127
Figure 18 Invariance over location in the standard model
Figure 19 Interacting areas in a spatial model based on adjacency
Figure 20 The spatial 'patch' based on area 10
127
Elffers, H (2003) Analysing neighbourhood influence in criminology Statistica Neerlandica 57(3),347–367.
- 259 -
Book length treatments include Lawson et al (2003) and Banerjee et al (2004).128 However, it
must be noted and as we shall see, not all the models that have been developed in spatial
econometrics are readily implementable as multilevel models.
What do we mean by adjacency: defining spatial neighbours
Spatial multilevel models include ‘extra’ spatial autocorrelation or dependence over and
above that from a strict hierarchy. This situation is akin to the standard model giving a
‘compound symmetry’ approach to dependency in repeated-measures time-series anlaysis
but more elaborate models being used to estimate more complex forms (Chapter 14, this
volume). Spatial analysis is, however, much more demanding than time series. With
repeated measures the current value can only depend on the past but in the analysis of
spatial series the dependency could go in any direction. We tackle this by defining adjacent
neighbours and specifying spatial weights that give the connectivity between places. Thus,
in the South Wales valleys you could specify connectivity up and down the valley but not
across from one valley to another, with weights for these spatial neighbours defined as the
inverse of the road distance between them. The set of spatial patches with their additional
spatial dependency is defined by these spatial neighbours while the weights give the degree
of connectivity between areas.
Figure 21 Three types of join: a) Bishop's case b) Rook's horizontal case c) Queen's case
The form of the spatial neighbours will have major role in determining the patterns
that will be find. Figure 21 shows a chessboard with three types of adjacency structure. A
Bishop’s case joins along the diagonals. This would give positive spatial auto-correlation (the
usual geographical case) as the Black areas will be joined to the Black and the White to
White. However a Rook’s case where the connectivity is either along the rows or the
columns would give negative autocorrelation; Black areas are joined to the White ones. A
Queen’s case connectivity, where each area is joined to the next area (horizontal, diagonal
and vertical joins), would show no dependence with its White-White, Black-Black and BlackWhite joins effectively cancelling each other out.
128
Lawson, A.B., Browne W.J., and Vidal Rodeiro, C.L. (2003). Disease Mapping using WinBUGS and MLwiN
Wiley. London; Banerjee, S and B.P. Carlin, BP and Gelfand, AE (2004) Hierarchical modelling and analysis for
spatial data. Chapman Hall, Boca Raton.
- 260 -
You have to specify these neighbourhood identifiers and weights before you model
but this allows the use of multiple sets of joins structures to evaluate a range of
geographical hypotheses. A classic example is Haggett’s (1976) study of measles in
Cornwall.129 If we look at Figure 22 we can colour a district black if it has a measles infection
in a particular week, white otherwise. The maps (a) and (b) show two characteristic map
patterns; (c) and (d) show the places joined in neighbourhood space so that places that are
near each other are joined up. The pattern of map (a) is then seen to have strong patterning
– lots of same colour joins, whereas map (b) has little dependency -the presence of measles
in a district does not make the presence of the disease in a neighbouring district any more
or less likely. Maps (e) and (f) show the connectivity not in neighbourhood space but in
hierarchical space where places are connected in terms of size of the population; in this
space, it is map (b) that has the strong spatial patterning.
Figure 22 Two patterns of measles in Cornwall, and two types of joint structure
Haggett took the county of Cornwall, and produced seven different joins structure
representing seven different plausible mechanisms of spatial spread (Figure 23). He then
calculated the degree of dependency for each week for forty weeks as the epidemic passed
through the county (Figure 24). The graphs on the left show a time-series plot of the
129
Haggett, P (1976) Hybridizing alternative models of an epidemic diffusion process Economic Geography Vol.
52(2),136-146
- 261 -
epidemic rising and falling, reaching a peak before week 20. The graphs on the right hand
side show a measure of spatial autocorrelation for each of the seven sets of weights. The
dotted vertical line is the peak of the epidemic, while the dotted horizontal line represents a
p value of 0.05. He found that the early weeks of the epidemic were characterised by
hierarchical spread as the disease went from large to large places (see G-6; G-5 and G8).
However, once past the epidemic peak, the dependency showed more local contagion (G1
and G7). The policy implications are clear: at the outset of an epidemic is it not sufficient to
vaccinate locally, there is a need to make sure that the large population centres are covered
if there is to be a possibility of containing the disease. Methodologically it is vital to specify
appropriate joins and weights structure for the process being studied. Unfortunately, this
geographical imagination is rarely brought to bear in applied research.
Figure 23 Seven alternative join structures corresponding to different spatial processes
Figure 24 Weekly degree of spatial autocorrelation according to the seven join structures
- 262 -
Three types of spatial models
We can recognise three basic types of spatial model130
Spatial lag dependence or autoregressive models in which the outcome depends on
the response in a neighbouring areas so that the same values (albeit lagged) can appear on
both sides of the equation. An example would be the prices of houses in this district
depending on the prices of houses in neighbouring districts. The spatially weighted sum of
neighbourhood housing prices (the spatial lag) enters the model as an explanatory
variable.131 Schematically one can think of this as:
the response depending on the spatially lagged response through the weight(W) and degree
of dependency, , and additional predictors X with regression weights, , plus some
unstructured residual term, . Because of their simultaneity these are demanding models to
fit, especially when as would be common in the multilevel approach that there are levels
below the area level such as people and occasion. As of writing they cannot be estimated in
MLwiN and should in general only be used if dependence on previous responses is of
substantive interest, or to put it another way, lagged effects should have the possibility of a
causal interpretation as spillover effects. An example might be that the number of infected
people (response at t) in an area this week might depend on the number of infected last
week (lagged predictor at t-1) in the area and on counts in neighbouring areas. For a
discussion on estimation see Corrado and Fingleton, (2011).132
Spatial residual dependence models in which the residual part of the model is
partitioned in to at least two parts: unstructured and structured area effects. Schematically
this can be written as
where the residual variation in the response conditional on the fixed part of the model
has an unstructured residual and a structured residual ; the structure of connectivity is
defined by
and the degree of dependency is defined by MLwiN can readily estimate
such models that allow extra spatial dependence by specifying the spatial neighbours and
associated weights as a multiple membership model. Such models have two important
properties. First the standard errors of the fixed part of the model are automatically
corrected for the degree of spatial dependency. The underlying cause of the dependency
130
Other models are possible for example the spatial Durbin model in which the predictors are additionally
involved in a lagged relationship:
.
131
For an explanatory video on this model , see http://geodacenter.asu.edu/spatial-lag-and
Corrado, L and Fingleton, B (2011) Multilevel modelling with spatial effects, Strathclyde Discussion Papers in
Economics,no 11-05.
http://www.strath.ac.uk/media/departments/economics/researchdiscussionpapers/2011/11-05_FINAL.pdf
132
- 263 -
could be spatially correlated omitted variables and spatially correlated errors in variable
measurement. Second because of the shrinkage properties of the estimated residual area
effects, the differences between areas (in the structured case) are effectively smoothed to
local means. A lot of spatial modelling is used to ascertain local hotspots for sudden infant
death or childhood leukaemia which typically has low incidence and hence a high stochastic
component. Treating the structured residuals as a distribution which is bound together by a
spatial neighbourhood weights matrix results in the estimate of relative risk for an area
‘borrowing strength’ from the surrounding areas. The successfulness of this strategy of
course depends on the appropriateness of the adjacency and weights matrix. Figure 28
shows a typical example of this in modelling ‘small-for-age children in Zambia. There is a
clear patterning in the structured spatial effects with a concentration in the north-east part
of the country.133
Figure 25 Structured and unstructured spatial effects 'small' children in Zambia 134
Spatial heterogeneity models allow the relationship between a predictor and a response
to vary across the map and it is possible to put in higher-level variables to account for this
formulation. This is the idea behind Geographically Weighted Regression.135 This is an
exploratory data analysis technique that allows the relationship between an outcome and a
set of predictor variables to vary locally across the map. The approach aims to find spatial
133
An excellent pedagogic account using baseball averages and a map of blood toxoplasmosis is Efron, B.;
Morris, C. (1977). Stein's paradox in statistics, Scientific American, 236 (5): 119–127.
134
Source is http://www.uibk.ac.at/statistics/personal/lang/publications/remltutorial.pdf
135
There is a website at http://ncg.nuim.ie/ncg/GWR/; a book-length-treatment is Fotheringham, A.S.,
Brunsdon, C., and Charlton, M.E. 2002) Geographically weighted regression: the analysis of spatially varying
relationships, Chichester, Wiley.
- 264 -
non-stationarity and distinguish this from mere chance.136 The GWR technique works by
identifying spatial sub-samples of the data and fitting a set of local regressions. Taking each
area across a map in turn, a set of nearby areas that form the ‘local’ surrounding region are
selected, a regression is then fitted to data in this region but in such a way that that nearby
areas are given greater weight in the estimation of the regression coefficients. The
surrounding region is known as the spatial kernel; this can be of a fixed spatial size across all
the map but this will result in unstable estimation in some regions where there are relatively
few areas on which to base the local regression, and will possibly miss important small scale
patterns where there is a lot of local areas clustered together spatially. Consequently, an
adaptive spatial kernel is often preferred so that a minimum number of areas that from the
region can be specified and the kernel extends out until this number has been achieved.
Changing the kernel changes the spatial weighting scheme which in turn produces estimates
that vary more or less rapidly over space. A number of techniques have been developed for
selecting an appropriate kernel and indeed for testing for spatial stationarity. 137 Once a
model has been calibrated a set of local parameter estimates for each predictor variable can
be mapped to see how the relation varies.
From the perspective of random coefficient modelling this procedure is inefficient in
that a separate model is being fitted to each area. The multilevel modelling approach is, as
usual, to fit a general fixed relation and allow this to vary from place to place as a random
slope as part of an overall model. The difference in these spatial models is that the random
slope is allowed to vary for each patch centred on each area so that the observations for
predictor variable become in effect the values for that patch. Poorly estimated local slopes
due to a small number of areas in the patch or lack of variation in the predictor for that
patch will be shrunk back to the general line, so that the procedure has some built-in
protection against over-interpretation. It is possible to fit such models in MLwiN through
multiple-membership models with additional random slopes associated with particular
predictors.
A warning that applies across all three types of model: it is worth stressing that
apparent spatial dependence can be produced by model misspecification in that an omitted
variable with distinctive spatial pattern could show up as spatial autocorrelation amongst
the residuals. Similarly, an incorrect functional form such as fitting a linear relationship
when the true relationship is concave could also show up as apparent spatial dependence.
This will occur when the predictor variable involved in the non-linearity has spatial
patterning so that the over- and under–estimation varies geographically. You are well
136
Brunsdon C, Fotheringham S, Charlton M (1996) Geographically weighted regression—modelling spatial
non-stationarity, Geographical Analysis 28: 281–289.
137
Paez, A. Uchida, T. Miyamoto, K. (2002) A general framework for estimation and inference of geographically
weighted regression models: 1: location-specific kernel bandwidths and a test for locational heterogeneity; 2:
spatial association and model specification tests Environment and Planning A, 34, 733-754, 883-904
- 265 -
advised to consider such mis-specification before embarking on these more complex spatial
models.
The spatial multiple membership model
MLwiN can fit both the spatial residual and the spatial heterogeneity model.138 The model is
simply conceived as a classification structure where a lower level observation is nested in a
higher level area and is also a ‘member’ of a varying number of local or neighbouring areas
as shown in Figure 26. Further insight is given by considering the town of Squareville in
Figure 27. Say we are dealing with an outbreak of swine-flu so that a person in district H is
conceived as being put at risk by the number of cases in the previous week in district H (a
strict hierarchical relationship) and by the disease counts in districts E, I, K and G (through
multiple membership relations). Similarly, we could conceive as someone in district A being
affected by the disease counts in district A, a strict hierarchy, and by disease presence in B,
C, D, a multiple membership relation. The linkage is therefore defined by including only
districts that are adjacent in the multiple membership relation. This can be created from a
map by the Adjacency For WinBUGS Tool if a map is stored in ArcMap format.139 We can also
place weights on the multiple membership relations to further emphasize the degree of
connectivity or even to define the membership. For example, we could use the inverse
distance between centroids of areas or, for more rapid fall off of the influence of other
places, the inverse of the square of distance. We probably would want to define a maximum
distance so that the entire map is not involved for each place. It is more natural to include
the ‘home’ area in the set of multiple membership areas (it is contributing to its own ‘local
mean’) and this has an important advantages with the software as it is then straightforward
to obtain the residuals for the structured spatial effects with one for each area.
Figure 26: The spatial model as a multiple membership model
138
Software in this area includes GEODA(http://geodacenter.asu.edu/software/downloads. and the Pythonbased PySAL (http://geodacenter.asu.edu/pysal) that has grown out of Anselin’s work. Roger Bivand maintains
the R Rask Task View: Analysis of Spatial Data which has a wide range of facilities (http://cran.rproject.org/web/views/Spatial.html). Stata has SPMLREG to estimate the spatial lag, the spatial error, the
spatial Durbin, and general spatial models (http://ideas.repec.org/c/boc/bocode/s457135.html) and SPPACK
for spatial-autoregressive models((http://ideas.repec.org/c/boc/bocode/s457245.html). In the R package,
spdep there are Lagrange multiplier tests for distinguishing between spatial lag and spatial residual
dependence models. In addition to MLwiN, Random coefficient multilevel approaches are available with
MCMC estimation in through BayesX and Geobugs, both of these packages have tools for mapping;
http://www.stat.uni-muenchen.de/~bayesx/bayesx.html;
http://www.mrcbsu.cam.ac.uk/bugs/winbugs/geobugs.shtml).
139
http://www.umesc.usgs.gov/management/dss/adjacency_tool.html
- 266 -
Figure 27 Multiple membership linkages in Squareville
A
B
C
D
E F
G
H
I J
K L M
Applying the spatial multiple membership model
We will now apply the spatial multiple membership model to three examples140

Low birth weights in South Carolina: the dependent variable is proportion of low
birth weight children in counties, and there are separate proportions for White and
Black children. The predictor variable is percent of the population in poverty in the
county. The idea behind this analysis is not just to estimate the general relation
between low birth weight and poverty but also to identify residual hot spots in the
manner of Sargent et al (1997) who found some low deprivation areas with high
140
MLwiN is also able to fit another spatial model which is known as the conditional autoregressive (CAR)
model (Besag et al, 1992). This, like the spatial multiple membership model handles residual dependency and
it is not a spatial lag model. The distinctive feature of this model is that there is one set of random effects,
which have an expected value of the average of the surrounding random effects:
ui ~ N (ui ,  u2 / ni ) ; where ui 
w
jneigh( i )
i, j
u j / ni
where ni is the number of neighbours for area i and the weights are typically all 1. MLwiN has only limited
capabilities for the CAR model ( they can be specified at only one level), although it is possible to include an
additional set of unstructured random effects in what is known as a convolution model. The BUGS software
allows more complex models. Browne (2009) shows how to set up the CAR model in MLwiN and how it can be
exported to BUGS for modification. Besag, J. ,York, J. & Mollie, A. (1992). Bayesian image restoration with two
applications in spatial statistics Annals of the Institute of Statistical Mathematics, 43:1-59; Browne, W.J. (2009)
MCMC Estimation in MLwiN, v2.13. Centre for Multilevel Modelling, University of Bristol.
- 267 -
relative risk probably due to gentrification resulting in the removal of lead-based
paint.141 The model is a binomial logistic one with added spatial dependence.

Respiratory cancer deaths in Ohio counties: the response is annual repeated
measures for 1979 to 1988, the aim being to discover hotspot counties with
distinctive trends. The model is a Poisson log model with an offset to take account of
the number of people exposed. This is an example of space-time analysis as the
model accommodates the repeated measures.

Self-rated health of the elderly in China: the response is whether an individual is in
good as opposed to fair/poor health. The model is a binomial logit model based on
individual survey data with additional spatial effects between Chinese provinces.
This short example aims to show that the models can be applied to more than just
aggregate data.
For the South Carolina and China data, basic knowledge of the binomial model as fitted in
MLwiN with MCMC estimation is presumed; for the Ohio data, you need to know about the
Poisson model. Chapters 12 and 13 cover this material.
Low birth weights in South Carolina142
The data
Retrieve the saves MLwiN worksheet
SCarolBweight.wsz
In the Names window you will see the following data summary.
141
Sargent, J., Bailey, A.J., Simon, P., Dalton, M. and Blake, M.(1997) Census tract analysis of lead exposure in
Rhode Island children. Environmental Research, 74: 159-168.
142
We thank Beatriz B Caicedo Velasquez for help with this section.
- 268 -
There are 46 counties with two entries for each in that the Black and White proportion is
going to be modelled in a single model. The response variable is the PropLBW which has
been calculated as the LBW count divided by TotBirths. The latter will form the
denominator. The % in poverty is a predictor and the Race variable identifies whether the
low weight proportion applies to White or Black children. The new concept is the set of 9
neighbourhood identifiers which follow immediately from the numerical county number in
column 8. It is a requirement of the software that these identifiers from a strict consecutive
sequence; here column 8 to column 17.
If we look at an extract of these identifiers, you will see that row one is the county of
Abbeville which is numbered 1. It has adjacency with 5 other neighbourhoods which have
the numbers 33, 30, 24, 23 and 4. The rest of the row is filled with zeroes and it is important
to do this and not leave missing values. The second row is exactly the same for this is the
multiple membership relation for the Black proportion of low birth weights as opposed to
the White proportion for Abbeville. Row 3 shows the adjacent neighbourhoods for county 2,
Aiken; it also has 5 neighbours in addition to itself. The county with the most adjacent
neighbourhoods, Orangeburg has 9.
The spatially unstructured model
We will first fit a two-level binomial logistic model with PropLBW as the response and with
Cons defining level 1 and County defining level 2. This is a useful device so that level 1 is in
effect the children who are nested within areas. This allows the modelling of level-1
binomial variation (the denominator of the proportion, the number of trials, is declared to
be TotBirths) and level 2 to be the between-County variation on the logit scale (see Chapter
12, this volume).143
143
This is exactly the same model and results that would be found if you had binary data and the single
predictor Black/White. The huge binary data of close to 55,000 births has been reduced to 92 rows without
any loss of information. This formulation does not allow extra-binomial variation at level 1 even though we
are modelling proportions. Another way of looking at this model is that there are two outcomes differentiated
by Race in each County.
- 269 -
Given the form of the data it makes for a more readily interpretable model for the Race
variable to be added to the model as two separate dummies and not as a constant and a
contrast:
So that the fixed part gives directly the logit of the proportion of low birth weights for each
Race averaged across the counties. We can then allow between-County variance for each
Race:
The two variances will summarise the between-County variations and the covariance will
allow us to estimate the correlations between the area patterns for Black and White babies.
We will also include the %poverty variable in an interaction with Race.
After initial estimation with IGLS, the model was run in MCMC with a burn-in of 500
simulations and an initial monitoring run of 5k. The MCMC options were modified so that
hierarchical centering was deployed at level 2 in the expectation of less-correlated chains
- 270 -
(Note this has nothing to do with centering % poverty which was left un-centred here.) The
IGLS estimates were:
On checking the initial run of the MCMC sampler, it was decided on the basis of the Effective
sample size which was as low as 100 from 5k simulations to increase the monitoring run to
15k. The model was re-estimated to be
MCMC15k
Fixed Part
White
Black
White.% in poverty
Black.% in poverty
Random Part
Level: County
White/White
Black/White
Black/Black
Level: Cons
DIC:
S.E.
Median
CI(2.5%)
CI(97.5%)
ESS
-2.529
-2.018
-0.001
0.014
0.161
0.091
0.011
0.006
-2.533
-2.020
-0.001
0.015
-2.839
-2.190
-0.023
0.002
-2.203
-1.834
0.020
0.026
1208
539
886
455
0.037
0.000
0.006
0.015
0.006
0.004
0.035
0.000
0.004
0.015
-0.013
0.001
0.074
0.013
0.017
1153
463
301
791.759
- 271 -
The fixed estimates are most easily appreciated via the graphing of predictions; we have
done this on the logit and the odds scale.144 Clearly Black babies have a higher risk than
White babies and this differential increases with county poverty.
To interpret the between-counties random part we can plot the residuals and their
95% confidence intervals. There looks to be little residual variation between counties for
Blacks and some variation for Whites. These areas might now be subject to further
investigations.
144
To make prediction with confidence intervals of the fixed part temporarily switch to IGLS estimation, but do
not re-estimate the model. To convert from logits to odds add the minimum value of the logit estimate to
make the base value zero then exponetiate the result, so that this base is set to 1. If you are using the
Customised predictions window the main effect for % poverty has to be specified but the fixed part ticked off
to avoid exact multicollinearity when the interactions with Race are included in the model.
- 272 -
For what it is worth, 145 we can look at the correlation between the differential logits of the
areas :
Finding that, unsurprisingly, there is no correlation. We can also do the pairwise plot of the
residuals and here we have turned them into odds with the South Carolina average being 1.
We have set the scale to be the same for both axes so that the unchanging nature of the
County Black relative odds is clearly seen. There is some element of differentials for the
White babies with the lowest rates being 25 percent below the county average and the
worst rates getting on for 50% above.
The spatially structured model
Before we can fit the model with spatially-structured random effects we have to create a
weight to go with the identifier for each county multiple membership. In the absence of
anything better we will use equal weights that sum to 1 for each set of neighbourhood joins.
We start by naming 10 adjacent empty columns with the Names wt0 (to hold the original
145
It is a bit silly looking at the correlation with a variable, the latent differential for Black, that does not really
vary!
- 273 -
county weight with itself), wt1 for the weight for the first neighbour and up to wt9. We then
take each neighbour in turn and recode all non-zero values to 1.
Then create a new variable TotWeight which is the sum of the ten weight variables
Finally we need to divide each of the ten weights by this total to get an appropriate weight.
Here is the procedure for Wt0.
- 274 -
and the same has to be done for all the nine other weights. Once this is completed the
weights columns should look like the following extract
Thus the place that has 9 neighbours plus itself and therefore has 10 equal weights of 0.1,
whereas the place with 3 neighbours plus itself has 4 equal weights of 0.25.
To specify the model begin in the Equations window by clicking on the response and
increasing the number of level to 3 (you have to be in IGLS/RIGLS to do this)
choosing CountyNO as the level 3 identifier. Then Click on the White and Black dummies in
the fixed part and allow variation at level 3 to get the following model.
- 275 -
Staying with the IGLS/RIGLS estimation, Start the model to convergence. These estimates
should be dismissed as they ignore completely the cross-classified structure. Now switch to
MCMC in the Esimation Control window. This will alow you to specify the multiple
membership cross-classification at level 3:
Model on main menu
MCMC
Classifications
Choose to Treat Levels as cross-classified
Choose Multiple classifications at level 3
Specify the Number of columns to be 10
Specify that Weights start Column Wt0
Done
[ leave level 2 to be the unstructured
effects]
[the largest number of neighbours]
[the idetifiers will be understood as the
10-1 columns following CountyNo]
[both the identifiers and weights have to
be in consecutive columns]
After this, the model will still look like a hierarchical one on the equations screen and
will not convey the information to recognize that the model is now cross-classified with
multiple membership weights. To overcome this MLwiN has a different notation to that
which we have been using so far.146 This is based on classifications rather than subscripts.
The response variable subscript i is still used to index the lowest level units, but
classification names are used for the subscripts of random effects. As there are several
classifications, they are given a superscript to represent the classification number starting at
2. This notation also clearly indicates weights and multiple memberships. In the bottom
toolbar of the equations window, choose Notation and ensure multiple subscripts is turned
off.
146
Browne, W.J., Goldstein, H. & Rasbash, J. (2001a). Multiple membership, multiple classification (MMMC)
models. Statistical Modelling, 1:103- 124. Rasbash, J. & Browne, W.J. (2002). Non-hierarchical multilevel
models In J.D. Leeuw & E. Meijer, eds., Handbook of Multilevel Analysis. Springer.
- 276 -
It can be clearly seen that the 3rd classification (CountyNo) is based on two sets of weighted
residuals –these are the spatially structured residuals- while the 2nd classification (county)
does not have any weights. The Hierarchy window will now reflect this ‘Classification’ view
of the model.
Start to estimate keeping a burnin of 500 and a monitoring of 15k and choosing the
hierarchical centering to be at level 3 to hopefully speed up estimation. You will probably
get a software error at this point due to poor starting values from the IGLS estimation which
of course does not know the spatial membership relations. The IGLS estimates found that all
level 2 random variances to be 0.0. This is an important issue in successful estimation that
we must consider in more detail.
A short digression on MCMC starting values
A common problem in MCMC estimation is the ‘non-positive definite matrix’ warning or
that the prior matrix is ‘non-positive definite’. Both problems will be flagged by the software
and the estimation will not proceed. The underlying cause of the problem is that either the
correlation between the random effects has been estimated by IGLS to be outside the range
of +1 to -1 or, as here, there is an estimated variance of zero. This can usually be overcome
by changing the covariance to an initial value of 0 implying a non-offending correlation of
zero; and/or by changing the variance to a small positive value. This can be achieved in
practice by editing c1096 which contains the IGLS estimates for the random part of the
- 277 -
model.147 This should be done in IGLS before switching to MCMC as the IGLS values are used
to define a prior distribution; we suggest that small values are used not to impose too much
prior information. This is particularly a concern when there are few higher-level units, as you
are effectively adding extra data (which you have made up!) to the problem. You may need
to experiment with values for the variance that are big enough to allow estimation but not
large enough to affect the results. We proceed by editing c1096 replacing the zeros for the
variances with the value 0.001 leaving the covariance at 0.0. A relatively low value of 0.001
was used so as not to have a major impact on the final estimates and DIC values.
Here are the problematic initial IGLS estimates with three variances of zero
We edit the values in c1096 as follows (the value of 1 in row 7 is the binomial constraint)
Before
After editing
147
A useful trick if there is a complex random part with problematic covariances is to click on the random
effects variance-covariance matrix
for a particular classification and choose a diagonal matrix, click again on
to request a full matrix. This action results in all covariances being initially set to zero for that classification..
- 278 -
Switching to MCMC estimation and requesting the full Bayesian specification of the model
by clicking on + in the lower tool bar, the MCMC starting model is
.
You can see that the initial estimates have been used to create a not-uninformative prior
matrix in the form of an inverse Wishart distribution for the unstructured and spatially
structured random effects.
Back to the spatially structured model
After 15k simulations the Effective Sample Size of several parameters was rather low
suggesting slow convergence so this was increased to 100k with a thinning of 10, so only 1 in
10 values were stored but all were used in the calculations. The resultant model estimates
are
- 279 -
Comparing the estimates of the two models (Aspatial and Spatial, see below) the spatial
model is a substantial improvement as the DIC has gone down from 792 to 779. The most
important new term is the spatially structured unexplained variance between counties for
Whites which is four time larger than the equivalent value for Blacks. There is not much
unstructured explained variance for either Blacks or Whites between counties. In the fixed
part of the model little has changed except the standard errors are higher now that the
residual spatial dependence has been explicitly modelled
Aspatial
S.E.
CI(2.5)
CI(97.5)
ESS
-2.523
-2.03
-0.001
0.161
0.091
0.011
-2.839
-2.190
-0.023
-2.203
-1.834
0.020
1208
539
886
0.014
0.006
0.002
0.026
455
Spatial
S.E.
CI(2.5)
CI(97.5)
ESS
-2.634
-2.016
0.006
0.175
0.120
0.011
-2.997
-2.256
-0.016
-2.292
-1.782
0.029
243
378
251
0.014
0.007
-0.001
0.028
389
0.242
0.057
0.056
0.123
0.050
0.038
0.087
-0.023
0.014
0.552
0.170
0.152
2248
1843
1582
0.014
-0.004
0.008
779.082
0.008
0.005
0.005
0.004
-0.015
0.002
0.034
0.004
0.020
2261
1976
2432
Fixed Part
White
Black
White.% in
poverty
Black.% in
poverty
Random Part
Level: CountyNo
White/White
Black/White
Black/Black
Level: County
White/White
Black/White
Black/Black
DIC:
0.037
0.000
0.006
791.759
0.015
0.006
0.004
0.015
-0.013
0.001
0.074
0.013
0.017
1153
463
301
The revised fixed part estimates are shown graphically below. There is now a positive
relationship with poverty for both Blacks and Whites although the line for Whites has a
particularly large uncertainty.
- 280 -
We can store the estimated level 3 spatial residuals in c400 and c401 and the level 2 aspatial
in c300 and c301 and then exponetiate these values to get odds compared to the S Carolina
average of 1.
The plot has been made with the same scale on both sets of axes to allow direct
comparison. The baseline of 1 has also been added to each graph. The relative importance
of the spatial differentials is immediately apparent particularly for Whites; there is a
threefold differential when the best is compared to the worst areas. There appears to be
some correlation between the spatial differentials and the correlation (taken from the
Estimate Tables) is 0.49. This covariance between the two sets of spatial effects will mean
- 281 -
that borrowing strength will occur between the map patterns for Black and Whites. 148 The
differentials between aspatial odds are very much smaller with a negative correlation of 0.4. There would appear to be ‘patches’ of the map with elevated unexplained differentials
that are more marked for Whites than Blacks but ‘patches’ that are high for one group tend
also to be high for another. Within a ‘patch’ the differences are small but where they are
slightly elevated for one group they are slightly reduced for the other.
If one examines the standard errors of these residuals they are large in comparison
to the estimate, and one reaction is that we can say little about even the spatial ‘patch’
differences. But another view is that this involves an element of ‘double counting’. Clayton
and Kaldor (1987) in their classic paper on lip cancer in Scotland simply map the residuals as
their uncertainty has already been accommodated through precision-weighted estimation.
Th estimates have already been shrunk, when there is a lack of evidence, to the overall
mean.149 Certainly in exploratory work with an emphasis on the precautionary principle of
not missing true high rates there is a great deal of sense in this. Consequently it would be
sensible to produce a table of values that could be printed or exported to a mapping
package. Currently the column labels in column 1 are long so we first need to un-replicate
them to produce a short column.
148
Jones, K and Bullen, N (1994) Contextual models of urban house prices: a comparison of fixed- and
random-coefficient models developed by expansion Economic Geography, 70(3), 252-272.
149
Clayton D, and Kaldor J. (1987) Empirical Bayes estimates of age-standardized relative risks for use in
disease mapping Biometrics, 43, 671-81.
- 282 -
In the Names window the Categories of column 1 can be copied and then pasted on to the
new short column.150 A print out of the county label and estimates can then be made. Here
we have added the ranks.
Aspatial
County
Abbeville
Aiken
Allendale
Anderson
Bamberg
Barnwell
Beaufort
Berkeley
Calhoun
Charleston
Cherokee
Chester
Chesterfield
Clarendon
Colleton
Darlington
Dillon
Dorchester
Edgefield
Fairfield
Florence
Georgetown
Greenville
Greenwood
Hampton
Horry
Jasper
Kershaw
Lancaster
Laurens
Lee
Lexington
McCormick
Marion
Marlboro
Newberry
150
White
Odds
Rank
1.01
13
1.03
5
1.00
24
1.01
17
0.99
32
1.03
8
0.99
28
0.99
34
1.01
12
0.98
39
1.04
3
1.02
10
0.99
35
0.99
30
1.01
11
1.01
15
0.99
37
0.98
41
1.01
16
1.07
1
1.00
25
0.97
42
1.00
21
0.98
38
1.00
22
0.99
29
0.98
40
1.00
23
1.03
7
1.04
4
0.99
27
1.02
9
0.99
33
0.97
44
1.05
2
0.99
31
Spatial
Black
Odds
Rank
0.99
33
0.97
42
0.99
34
0.99
30
1.00
25
0.98
37
0.99
29
1.01
15
0.98
39
1.02
9
0.98
38
0.98
36
1.03
3
1.00
20
0.97
41
1.00
24
1.02
12
1.03
4
0.99
31
0.95
46
1.01
18
1.03
2
1.00
22
1.02
11
0.99
35
1.01
16
1.01
14
1.00
21
0.99
32
0.96
45
1.01
17
0.97
40
1.02
6
1.02
7
0.97
43
1.01
19
White
Odds
Rank
1.02
21
1.23
8
0.77
42
0.99
24
0.73
44
1.24
7
0.70
45
0.84
36
0.84
37
0.92
31
1.72
2
1.08
15
1.37
6
0.80
39
0.76
43
1.20
10
0.99
25
1.02
20
1.01
23
1.86
1
1.09
14
1.07
18
1.07
17
0.98
26
0.80
40
0.93
29
0.69
46
1.03
19
1.64
3
1.17
11
0.95
28
0.91
33
1.21
9
0.83
38
1.60
4
0.88
35
Black
Odds
Rank
0.96
31
0.93
36
0.87
42
0.96
30
0.89
39
0.96
29
0.86
43
0.98
26
0.84
45
1.07
11
1.06
12
0.95
33
1.29
1
0.94
35
0.80
46
1.05
14
1.08
10
1.16
6
0.96
32
0.94
34
1.06
13
1.20
4
1.02
18
1.08
9
0.87
40
1.01
22
0.92
37
1.02
20
1.11
7
0.87
41
1.01
21
0.85
44
1.18
5
1.05
15
1.00
24
0.99
25
Copy 3 colnumber in the Command window, copies labels to the clipboard. Because of the option 3.
- 283 -
Oconee
Orangeburg
Pickens
Richland
Saluda
Spartanburg
Sumter
Union
Williamsburg
York
1.03
0.96
1.00
1.00
0.99
1.01
0.97
1.00
1.00
0.97
6
46
19
20
36
14
45
26
18
43
0.97
1.04
0.99
1.00
1.01
1.00
1.02
1.02
1.00
1.03
44
1
28
23
13
26
8
10
27
5
1.11
0.92
0.96
1.07
0.91
1.15
0.77
1.40
1.01
0.90
13
30
27
16
32
12
41
5
22
34
0.89
1.20
0.96
1.02
1.03
1.02
1.01
1.22
0.98
1.09
38
3
28
19
16
17
23
2
27
8
Fairfield County in the spatial model has the highest relative odds of 86 per cent above the
South Carolina average for Whites which must mean it is in a patch of high areas (its patch
being Union (1.40), Richland (1.07), Newberry the only one below 1 (0.88), Lancaster(1.64),
Kershaw (1.03), Chester (1.08). Fairfield is also ranked first for the aspatial values with an
additional 7 percent above this ‘patch’-induced value. Remembering that poverty has been
conditioned on, it would be interesting to see what is distinctive about this county.
If we plot up the odds for Whites we see that the spatially structured map show a
concentration of higher risk in the north of the State once poverty is taken into account.
Both the Black and White maps in the spatially structured case show lower risk in the south
west of the map
- 284 -
Spatial heterogeneity: a random effects GWR
We have in fact already fitted a model with spatial heterogeneity in that the effect for Black
and White, two individual attributes, have been allowed to vary over the map between
counties aspatially and spatially, But we can also allow the higher-order effect of %poverty
to vary between spatially between counties. In estimation and interpretative terms this is a
tall order as we will need an additional seven random terms to model what we could call a
random–effects ‘geographically weighted regression’. In IGLS estimation click in the
equations window on the fixed terms involving %poverty and allow them vary at level 3. The
resultant model should look like
- 285 -
Set a diagonal matrix at level 3 by clicking on Ω(3) edit the column c1096 to give the starting
value of 0.001 for the any variances that are zero and then click on Ω(3) again and choose
the full set of covariances. Do this in IGLS before switching to MCMC. The model before
MCMC estimation is :
After 200k MCMC simulations we get the following results and stored estimates, even with
this long run, the ESS is not really big enough
- 286 -
REGWR200
Fixed Part
White
Black
White.% in poverty
Black.% in poverty
Random Part
Level: CountyNo
White/White
Black/White
Black/Black
White.% in
poverty/White
White.% in poverty/Black
White.% in poverty/
White.% in poverty
Black.% in poverty/White
Black.% in poverty/Black
Black.% in poverty/
White.% in poverty
Black.% in poverty/
Black.% in poverty
Level: County
White/White
Black/White
Black/Black
DIC:
S.E.
Median CI 2.5%
CI 97.5%
ESS
-2.803
-1.978
0.017
0.011
0.327
0.342
0.031
0.030
-2.811
-1.982
0.018
0.011
-3.427
-2.638
-0.043
-0.052
-2.126
-1.297
0.074
0.071
91
64
90
54
0.229
-0.004
0.236
-0.015
0.257
0.179
0.282
0.023
0.154
-0.002
0.154
-0.010
0.043
-0.338
0.044
-0.071
0.868
0.291
0.917
0.011
518
526
462
736
0.000
0.020
0.020
0.006
0.000
0.019
-0.036
0.011
0.038
0.033
976
3058
0.000
-0.014
-0.001
0.017
0.023
0.004
0.000
-0.009
-0.001
-0.033
-0.070
-0.008
0.035
0.011
0.007
1321
697
3603
0.018
0.005
0.018
0.011
0.031
3084
0.015
-0.002
0.034
807.559
0.009
0.008
0.016
0.013
-0.002
0.031
0.005
-0.021
0.014
0.040
0.013
0.074
3922
4927
4892
- 287 -
The DIC shows that this randomised GWR model is not an improvement over the model
with just White and Black spatially structured random effects with its DIC of 779.082.
However, both slope terms are in fact quite well estimated with an effective sample size of
over 3000. Moreover the smoothed histograms of the values suggest that there is not much
support for the parameters being zero suggesting that relationship between %in poverty
and Low Birth weight does vary across the map. Emboldened by this we estimated the
residual differential intercepts and slopes and plotted them on a varying relations graph and
as maps.
County
Abbeville
Aiken
Allendale
Differential Intercepts
White Births
Black Births
0.0105
0.0198
0.0210
-0.0353
0.0062
0.0329
- 288 -
Differential Slopes
White Births
Black Births
0.0216
0.0236
0.0502
-0.0141
0.0034
-0.0499
Anderson
Bamberg
Barnwell
Beaufort
Berkeley
Calhoun
Charleston
Cherokee
Chester
Chesterfield
Clarendon
Colleton
Darlington
Dillon
Dorchester
Edgefield
Fairfield
Florence
Georgetown
Greenville
Greenwood
Hampton
Horry
Jasper
Kershaw
Lancaster
Laurens
Lee
Lexington
McCormick
Marion
Marlboro
Newberry
Oconee
Orangeburg
Pickens
Richland
Saluda
Spartanburg
Sumter
Union
Williamsburg
York
-0.0222
0.0196
0.0591
-0.0162
-0.0146
-0.0002
0.0015
-0.0454
-0.0532
0.0066
-0.0192
0.0231
-0.0089
0.0211
0.0019
0.0195
-0.0047
0.0271
-0.0032
-0.0270
0.0096
0.0191
0.0346
-0.0195
0.0244
-0.0405
-0.0153
-0.0082
0.0102
0.0075
0.0111
0.0018
-0.0274
0.0191
0.0375
-0.0424
-0.0423
0.0137
-0.0078
-0.0112
-0.0461
-0.0368
-0.0256
-0.0087
0.0021
-0.0052
-0.0254
0.0248
-0.0269
0.0415
0.0331
0.0341
-0.0007
-0.0054
-0.0108
0.0073
-0.0036
0.0392
-0.0051
0.0250
0.0116
-0.0152
0.0308
-0.0127
-0.0116
-0.0132
-0.0155
0.0081
-0.0053
0.0210
0.0074
-0.0446
-0.0240
0.0020
0.0100
0.0072
-0.0282
-0.0202
-0.0018
-0.0231
-0.0294
0.0149
0.0139
0.0223
0.0157
0.0427
- 289 -
0.0534
-0.0391
0.0229
0.0044
0.0222
-0.0665
0.0271
0.0015
0.0231
0.0462
-0.0239
-0.0153
0.0251
0.0412
-0.0519
0.0425
0.0607
-0.0692
-0.0381
-0.0228
-0.0748
-0.1240
-0.0981
-0.0175
0.0742
-0.0318
-0.0288
-0.0756
-0.0319
0.0197
0.0556
0.0406
0.0257
0.0329
0.0057
-0.0420
0.0194
-0.0297
0.1121
-0.0290
0.0232
0.0000
0.0029
-0.0975
0.0279
0.0340
-0.0351
0.0655
0.0193
-0.0048
0.0581
-0.0065
-0.0131
-0.0170
-0.0125
-0.0026
0.0159
0.0324
0.0320
0.0110
-0.0217
-0.0397
-0.0190
0.0219
-0.0216
0.0444
0.0292
0.0164
0.0170
0.0471
0.0690
0.0158
0.0023
0.0056
0.0024
0.0048
-0.0272
-0.0449
0.0170
-0.0456
-0.0480
-0.0125
-0.0224
-0.0705
0.0215
0.0218
These estimated differential intercepts and slopes have to be seen in the context of
the overall intercepts and slopes that are found generally across S Carolina, namely -2.803
for the White Intercept where %poverty is zero and -1.978 for the Black intercept. The
respective general slopes between Low birth weight and % in poverty are 0.0117 and 0.011.
The varying relations plot given below cannot simply be derived by the Predictions or
Customised Predictions window as there is only a single value of % in Poverty for a county,
but instead we need the range of values that are found in the spatial patch that surrounds
that county. Thus Spartenburg has a low and narrow range of % in poverty but the relation
in this patch is a steep positive one.
- 290 -
The maps show the variation of both the White and Black slopes.
What are we to make of these quite marked variations? Even though GWR is supposed to be
an exploratory technique, we suggest that the results are treated with considerable
circumspection. The DIC suggests that the more complex model is not a more parsimonious
fit than the simpler model. All the fixed terms including the general slope terms for % in
poverty have relatively small effective sample size. The differential slopes for each county
patch are somewhat implausible in that they are large enough to overwhelm the general
positive relation between poverty and low birth weight for both Whites and Blacks, so that
in some counties higher area-based poverty is associated with a lowering of the risk of lowbirth weight children. Moreover, there is a real danger of capitalising and maximising on
model mis-specification and creating a cross-level fallacy in that we do not have data on
whether the individual child is or is not in poverty only that they live in poor areas151
Respiratory cancer deaths in Ohio counties: space-time modelling
The data
Retrieve the saved MLwiN worksheet
OhioCancer79-88.wsz
151
Subramanian S V, Jones K, Kaddour A, Krieger N. (2009) Revisiting Robinson: the perils of individualistic and
ecologic fallacy, International Journal of Epidemiology, 38(2), 342-360.
- 291 -
In the Names window you will see the following data summary.
There are 880 observations which represent 10 years of observation for 88 counties. The
contents of the column are as follows:
C1
C2—c9
C10
C11
C12
C13
C14
C15
a numerical code for each and every county, 1 to 88, which has been
replicated 10 times;
the neighbourhood identifiers; there are a maximum of 8 neighbourhood
adjoining a county in addition to the county itself;
the constant, as usual just a set of 1’s;
the number of observed respiratory cancer deaths in each county in each of
the 10 years
the expected number of deaths if State-wide age-sex rates applied;
time as a numerical variable, 0 to 9 representing 1977 to 1988;
another time variable which takes the numerical values 0 to 9, but is a
categorical variable;
the Standardised Mortality Ratio; a measure of the risk of dying from
respiratory cancer in a particular county in a particular year, it has simply
been calculated as the Observed number of deaths divided by the Expected
so that the State risk is set to 1, and a value of 2 is double the all-State risk,
while a value of 0.5 is half the State risk.
At the outset we need to create another column, named ‘County2’ which is a duplicate
of column 1 so that we have a different identifier for the structured and unstructured
classification.152
calc ‘County2’ = c1
152
If this is not done, the Stored model functionality will become confused.
- 292 -
The aim is to model the SMR rather than to treat it as a simple descriptive value so as to
Identify ‘hotspot counties’ with distinctive trends. The problem with the SMR is that as a
ratio it is highly unstable and when the expected value is close to zero (signifying a rare
event) any positive count will lead to a ratio above 1.153 The SMR does also does not take
into account that its variance is proportional to 1/Expected so that with rare events and
small populations we have a great deal of natural heterogeneity, which makes spotting
hotspots difficult. From a modelling perspective, it is like fitting a saturated model with in
this case 880 separately estimated parameters. Respiratory cancer deaths are quite rare
and therefore we are going to model the observed outcome, taking account of the expected
counts as a log Poisson model. Such a model will explicitly take account of the stochastic
nature of the counts and their heterogeneity and we will additionally include a spatial
multiple membership relation so that poorly estimated counties (with small numbers) will
borrow strength from neighbouring more precisely-estimated counties with larger numbers
of deaths. In comparison to the descriptive SMR’s we will be looking for the degree of
evidence that supports that a county has high rates or an upward trajectory based on
explicit modelling of trends and area differences.
Unstructured Poisson area-effects models
We will start the modelling with an aspatial two level Poisson. We specify the model as
follows:







Model on main menu
Equations
y, the response is Obs
two level model, ij, with County2 declared to be the level 2 identifier; and Time as
level 1, done
Change the red xo the be Cons, done to get an overall grand mean value, click on
j(County2) and i(Time) random effects, done
Click on N for Normal and change response type to be Poisson
Click on Estimates to get the following model
153
Jones, K and Kirby, A (1980) The use of Chi-square maps in the analysis of census data; Geoforum
11(4):409-17.
- 293 -
Where  is the underlying mean count, and we are going to model the loge of the observed
count (this means on exponentiation we cannot have an estimated risk of below 0). At level
1 the variance of the observed deaths is equal to this mean, that is it is assumed to follow a
Poisson distribution, and there is a level 2 variance,
, which summarises the between
county differences. We also fitted a model with extra –Poisson variation but there was not
substantial evidence for this when time trend and area effects were taking into account.
Clicking on the  in the second line of the window allows us to specify the offset (this will
allow the modelling of the rate Obs/Exp by taking the loge of the expected value onto the
right hand side of the equation and constraining it to 1. As the window warns we first have
to take the loge of the expected value



Data Manipulation on Main Menu
Command interface
enter the following command into the lower box, one at a time and press return
Name c17 'LogeExp'
calc 'LogeExp' = loge('Exp')
return to the clicking on the  in the second line of the equations window, we can now
specify the loge(offset), followed by done to get the revised equation
We can estimate this null model and then switch to MCMC estimation with a burn-in of 500,
a monitoring run of 50000, a thinning of 10 and requesting in the MCMC options
hierarchical centering at level 2. The Poisson model is known to have MCMC chains that are
slowly changing so that is why we have started with a monitoring run of 50k. These values
can then be stored as 2LevNull. The model estimates are as follows.
- 294 -
Not surprisingly the loge grand mean estimate is close to zero indicating that the age
standardisation has worked as this corresponds to an average SMR of 1 when we
exponetiate this value.
We can now ascertain whether there is a linear trend in these values or whether something
more complicated is needed.




Model on main menu
Equations
Set back the estimation to IGLS so as to modify the model
Add term choose Year, tick on orthogonal polynomial, and choose degree 1
This specifies a linear trend, IGLS estimation to convergence and then MCMC estimation,
storing the model as a Linear. We can now see if there is evidence that this linear trend
varies from place to place. Click on the orthog_Year^1 variable and allow ‘random-slopes’
variation for the linear trend parameter over county as in the specification below.
- 295 -
Estimate to convergence with IGLS and then MCMC estimation storing the model as
LinSlopes. We will now see if there is any evidence for more complex trends by modifying
the orthog_Year^1 variable and requesting a 2nd order polynomial and allow this quadratic
parameter to vary over County2, as in the specification below,
Again estimate to convergence with IGLS and then using MCMC, storing the results as
Qslopes.154
We can now compare the results of the four models and we can see that the most
parsimonious with the highest predictive capacity (lowest DIC) is the model in which the
linear slope parameter is allowed to vary over counties. That is while there is no evidence of
a strong general trend (the linear model is not an improvement over the null random –
intercepts model), there is evidence of between county differences in the linear trend. That
must mean that in some counties the trend is an upwards and in other downwards, given
that the general term is flat. In fact of course, we should not have anticipated a general
trend as the SMR has been calculated on the basis of the expected value for that year. The
linear slope model will now be the base for the spatial modelling. Interestingly, a model
without any random effects has a DIC of 7186 providing strong evidence that there are
differences between Ohio counties in their risk of respiratory cancer.
154
If you get a warning message on starting MCMC estimation, switch back to IGLS, click on Ω and set to
diagonal matrix and then click on it again to get the full covariance matrix, this will set the covariances to zero,
switch to MCMC and estimate the model.
- 296 -
2levNull
Fixed Part
Cons
orthog_Year^1
orthog_Year^2
Random Part
Level: County2
cons/cons
orthog_Year^1/
cons
orthog_Year^1/
orthog_Year^1
orthog_Year^2/
cons
orthog_Year^2/
orthog_Year^1
orthog_Year^2/
orthog_Year^2
Level: Time
bcons.1/bcons.1
DIC:
S.E.
Linear
S.E.
LinSlope S.E.
Qslopes
S.E.
-0.090
0.022
-0.090
0.003
0.022
0.013
-0.090
0.028
0.022
0.020
-0.091
0.029
-0.009
0.022
0.020
0.017
0.038
0.007
0.037
0.027
0.038
0.002
0.007
0.004
0.038
0.002
0.007
-0.007
0.007
0.003
0.007
0.003
0.004
0.003
0.001
0.002
0.003
0.002
1.000
5743.9
0.000
1.000
5758.348
0.000
1.000
5759.7
1.000
1.000
5743.2
0
Spatially structured Poisson area-effects models
Although we already have the neighbourhood identifiers that define the patch we also have
to create a weight to go with these identifiers. Again in the absence of anything better we
will use equal weights that sum to 1 for each set of neighbourhood joins as we did with the
Carolina data. We start by naming adjacent empty columns to hold the 9 sets of weights
with the names wt0 (to hold the original county weight with itself), wt1 for the weight for
the first neighbour and up to wt8. We then take each neighbour in turn and recode all nonzero (that is values 1 to 88) to a new value of 1, summing the values across the rows to
create a new variable TotWeight. Finally we need to divide each of the ten weights in turn
by this total to get an appropriate weight. Once this is completed the weights columns
should look like the following extract.
- 297 -
Thus the County 1 that has 5 neighbours plus itself and therefore has 5 equal weights of 0.2.
You can see that the weights are replicated 10 times to reflect the repeated measures
structure of years within counties.
To specify the model begin in the Equations window by clicking on the response and
increasing the number of level to 3 (you have to be in IGLS/RIGLS to do this)
choosing County as the level 3 identifier. Then Click on the Constant in the fixed part and
allow variation at level 3 to get the following model.
- 298 -
Staying with the IGLS/RIGLS estimation, Start the model to convergence. These estimates
should be dismissed as they ignoer completely the cross-classified structure. Now switch to
MCMC in the Esimation Control window. This will alow you to specify the multiple
membership cross-classification at level 3:
Model on main menu
MCMC
Classifications
Choose to Treat Levels as cross-classified
Choose Multiple classifications at level 3
Specify the Number of columns to be 9
Specify that Weights start Column Wt0
Done
[leave level 2 to be the unstructured
effects]
[the largest number of neighbours]
[the identifiers will be understood as the
8 columns following County]
[both the identifiers and weights have to
be in consecutive columns]
After this, the model will still look like a hierarchical one on the equations screen and
will not convey the information to recognize that the model is now cross-classified with
multiple membership weights. To overcome this, use the classification notation. In the
bottom toolbar of the equations window choose Notation and ensure multiple subscripts is
turned off.
- 299 -
It can be clearly seen that the 3rd classification is based on a set of weighted residuals –
these are the spatially structured residuals- while the 2nd classification does not have any
weights; these are the unstructured residuals.
Start to estimate keeping a burn in of 500 and a monitoring of 50k and choosing the
hierarchical centering to be at level 3 to hopefully speed up estimation. You may get an
error at this point due to poor starting values from the IGLS estimation which of course does
not know the spatial membership relations. This can be changed by editing the column
(c1096) where the estimates are stored and replacing the zeroes for variances with 0.01;
leaving the covariances alone. A low value, 0.01 was used so as not to have a major impact
on the final result. You may have to experiment with this (we found that 0.001 was too
small and we still got a warning). Initial estimates before running the MCMC chains but after
modifying c1096 and requesting more details on the model specification (by clicking on + on
the toolbar at the bottom of) are as follows.
After 50k simulations with hierarchical centering at level 3 the results were stored in
a model labelled Spat1. A more complicated model was the fitted in which the parameter
for the linear trend was allowed to vary over the spatially structured neighbourhoods. This
was achieved by clicking on the linear trend parameter in the fixed part and allowing it to
vary at level 3. The same process was adopted - estimating in IGLS, changing variances that
- 300 -
are zero to 0.01 and then switching to MCMC with a burnin of 500 followed by 50k
simulations to get the following results which are stored as Spat2.
Comparing the three sets of models they have very similar DIC values; there is not a great
deal of evidence that a spatial model is needed.
LinSlope
Fixed Part
Cons
orthog_Year^1
Random Part
Level: County2
cons/cons
orthog_Year^1/cons
orthog_Year^1/
orthog_Year^1
Level: Time
bcons.1/bcons.1
Level: county
cons/cons
orthog_Year^1/cons
orthog_Year^1/
orthog_Year^1
DIC:
S.E.
SpatMod1 S.E.
SpatMod2 S.E.
-0.091
0.029
0.022
0.020
-0.085
0.030
0.035
0.020
-0.083
0.032
0.034
0.025
0.038
0.001
0.007
0.007
0.004
0.003
0.037
0.001
0.006
0.027
-0.006
0.002
0.054
0.010
0.016
4652
1464
973
1.000
0.000
1.000
1.000
1.000
0
0.078
0.055
0.067
0.007
0.016
0.006
0.020
0.014
5743.306
5743.434
5743.129
However, in pedagogic mode, we will continue by looking at the variance function at level 2
for the unstructured area effects and at level 3, the spatially structured effects; storing them
in c31 and c32 respectively.
- 301 -
We can the plot them up against time to see if the rates are diverging or converging; the
differences between the spatial ‘patches’ are growing while the aspatial differences are
much smaller and declining.
However we have to be careful not to over-interpret the results when there is weak
evidence for these additional structured spatial effects. Consequently we will return to the
MCMC unstructured model with varying slopes for the linear trend over areas which is a
substantial improvement over simpler models.
- 302 -
Using this model we can estimate and plot the between-area variance function with 95 CI’s
and the predictions for each place.
There does appear to be genuine differences between counties but there is not strong
evidence that this is increasing over time. The predictions for each county can be obtained
on a logit scale
turned into a relative risk by exponentiation,
calc c41 = expo(c41)
- 303 -
and plotted against Year
There are therefore half a dozen or so counties that are potential hotspots with a rising
relative risk. This model could then be used as a basis for more focussed hypothesis testing
by including distance from known pollution sources in the model.
An appreciation of what we have achieved is to compare the results for raw unmodelled relative risk with the modelled relative risk. We have highlighted two counties in
all the three graphs plotted below. County 80 which is coloured blue has SMR of 0.31 in
1988 which is based on an observed to expected ratio of 6 to 19 deaths. This is modelled to
be a consistently low. County 41 which is coloured red has an SMR of 1.57 in 1988 which is
derived from an observed to expected ratio of 77 to 49. It is consistently high and rising. The
raw data has some extreme peaks where the SMR exceeds a risk in excess of 2.25. The
modelled data for this county (34) is shrunk through precision-weighted estimation towards
the general trend as the rate is based on an expected count of only 9 or so deaths. With
small expected counts we can expect non-substantive high relative risks just by chance.
- 304 -
Self-rated health of the elderly in China155
This short section is meant to be illustrative of what can be achieved and not to provide a
detailed account of the modelling process. The distinctive feature of this study is that
individual data are used in the model and not already aggregated data. It is based on survey
data from the Chinese Longitudinal Healthy Longevity Survey a sample of over 13000 elders
aged over 60 in 22 provinces of China in 2002. The response is self-rated health, a subjective
assessment of one’s own health status. We are going to model this in a binary form: in Good
health as opposed to Poor/Fair heath. The data has a three level hierarchical structure with
individuals nested within provinces nested in ‘patches of provinces’. A null or empty model
binomial logit is specified and 100,000 MCMC simulations resulted in the following
estimates.
There is a quite a moderately large spatially structured variance of 0.343 on the logit scale
which using Larsen’s MOR procedure (Chapter 12) translates into an odds of 1.75 for the
average effect of conceptually re-locating a person from a low to a high patch. The 95% CI’s
for odds ratio using the MCMC 2.5% lowest and highest values however show a lot of
uncertainty about the size of this patch effect as the MOR credible intervals range from 1.23
to 2.57. Equivalent values for the unstructured variance of 0.038 gave a MOR of 1.19 and
95% credible intervals of 1.09 and 1.35 which are clearly smaller. The plot below shows the
estimated odds for each province in the null model and the bigger effects of the spatially
structured effects are clearly visible. An interesting case revealed by this plot is Henan
province in that the Henan patch is below the all China average of 1, but the aspatial
unstructured area effects for Henan are clearly well above 1. The core area of the patch,
Henan itself, has better health for the elderly than its neighbours.
155
We thank Zhixin Feng for his research reported in this section.
- 305 -
A more complex model was fitted in which a large number of individual and family
circumstances were included
The spatially structured effects remained moderate in their size but there was a large
standard error, so that we must treat with caution. Here are the plots of the spatially
structured and unstructured residuals.
- 306 -
It is noticeable while the effects are attenuated once account is taken of the differential
composition of the provinces in terms of individuals, Hunan ‘patch’ is still generally low
health while Hunan itself is even more of an outlier. There is something that is make those
sampled in Hunan eport better health than its provincial neighbours even when account is
taken of individual age, sex, education, income, source of finance, who lives with them, and
whether they think they have sufficient finance. Here are the maps of the spatially
structured residuals for the null and the complex model.
Null model: spatially structured province
effects
Full model: spatially structured province
effects
While care must be taken in interpretation as different cut-offs have been used on the maps
(both use quartiles), essentially the same map patterns are found.
- 307 -
What we have learnt
 Multilevel models and software are capable of fitting models for spatial residual
dependence and heterogeneity and they do this through multiple membership
models.
 These models can be used with different types of response, including binary and
count data, and in combination with other structures such as repeated measures and
multivariate designs.
 The spatial lag model in contrast is not readily fitted.
 The weights matrix and what areas are defined as neighbours can crucially
determine the results but unfortunately in practice, not a lot of theory is available to
guide the choice; consequently they are often based merely on convenience.
 In spatial epidemiology and mapping of risk, spatial models can be protective of
finding false positives due to precision-weighted shrinkage. However, this goes
somewhat against the pre-cautionary principle of trying to avoid false negatives.156
 There is a danger that such models are applied when there are no genuine spatial
processes at work and use in this exploratory fashion runs the risk of maximizing misspecification particularly when only aggregate data is involved.
156
Principle 15 of the 1992 Rio declaration "In order to protect the environment, the precautionary approach
shall be widely applied by States according to their capabilities. Where there are threats of serious or
irreversible damage, lack of full scientific certainty shall not be used as a reason for postponing cost-effective
measures to prevent environmental degradation.
- 308 -