Statistical Modeling

Transcription

Statistical Modeling
Estimating the
impact of
air pollution
using small
area
estimation
Male
chimpanzee
aggression towards
females:
a test of the sexual
coercion hypothesis
mine cetinkaya-rundel
statistical science
Estimating the impact of
air pollution
using small area estimation
data
91344
data
91340
91345
91331
91343
91335
91352
91504
91406
91501
91411 91401
91607
91502
9150591506
91201
91601
91107
91105
91205
91204
90027
90210
90046
90069
91702
91104
91206
91203
91602
91403 91423
91103
91101
91106
91108
90039 90065
90028
90038 90029
90031
90031
90026
90004
90012
90020
90024
90212
90005 90057
90017
90035
91754 91755
90033
90019
90006
90025
90013
90063
90015
90064
90402
90403
90034
90016
90007
90404
90022
90640
90232
90023
90640
90401 90405
90011
90640
90066
90037
90291
91767
9004890036
90292 90066
90601
90601
90602
90003
90044
90280 90280
90304
90262
90061
90250
90280
90222
90247
90278
90254
90277
90248
90504
90746
90502
90745
90745
90807
90806
90710
90744
90813 90813
90802
9073190802
90731
90802
0.4
0.4
0.4
0.2
Freeways
No data
Wheezing
Freeways
No data
1.0
Medication
useinfor
or asthma
Wheezing
thewheezing
last 12 months
1.0
eda
DoctorWheezing
diagnosed
infections
orear
asthma
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
Freeways
Freeways
No data
data
No
0.0
Doctor diagnosed asthma
1.0
1.0
Freeways
Freeways
No data
No data
Medication use for wheezing or asthma
Figure 1.3: Distribution of observed rates of outcomes of interest
Freeways
No data
0.0
0.0
0.8
0.0
Wheezing or asthma
0.0
Freeways
Freeways
No data
No data
0.2
0.2
0.0
0.0
1.0
0.8
0.8
0.8
0.6
0.610
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
Freeways
No data
0.0
Freeways
No data
0.0
30
40
0.9
25
35
30
25
Freeways
Monitoring stations
No data
Freeways
Freeways
Monitoringstations
stations
Monitoring
Nodata
data
No
20
Average NO, entire pregnancy (LUR, optimal), per 20 ppb
Freeways
Monitoring stations
No data
50
3
Average
PM2.5,pregnancy
entire pregnancy,
per 10µg/m
Average
NO2, entire
(LUR, optimal),
per 10 ppb
45
23
40
30
22
35
21
30
25
20
25
19
20
20
18
15
15
17
Freeways
Monitoring stations
No data
10
Average O3, entire pregnancy, per 10 ppb (10am − 6pm)
24
35
60
eda
3 6pm)
Average
O3, entire
per 10 ppb
(10am −
Average
PM10,pregnancy,
entire pregnancy,
per 10µg/m
10
0.5
60
45
45
40
35
35
30
30
Freeways
Freeways
Monitoring stations
Monitoring stations
No data
No data
Average PM2.5, entire pregnancy, per 10µg/m3
1.4
1.3
1.2
50
15
0.6
50
40
Figure 1.4: Distribution of air pollution exposure estimates
55
20
0.7
55
16
Average CO, entire pregnancy, per 1 ppm
0.8
25
20
25
24
23
22
1.1
45
21
13
1.0
20
40
0.9
19
35
0.8
18
30
Freeways
Monitoring stations
No data
25
20
0.7
Freeways
Monitoring stations
No data
0.6
0.5
Freeways
Monitoring stations
No data
17
16
motivation - PH
Develop a model that makes use of information readily
available in public data bases instead of having to
collect data via costly follow-up surveys
motivation - STAT
rare outcomes + low sample
sizes from each area
↘︎
Wheezing
1.0
0.8
Wheezing in the last 12 months
fluctuations
highlighted
might be due to chance,
not real changes in the
underlying risk factors
0.6
↘︎
smoothing
0.4
0.2
Freeways
No data
0.0
Freeways
No data
= E i ✓i .
(4.4)
can be calculated as
riskarea
andassuming
can
as (4.5)
µ be
= constant
Ecalculated
✓ . risk and
Ei = ni r,
(4.4)
i
i 1996].
i
ributions
[Mollie,
Then,
yi |xi ⇠ P oisson(µ
Ei ini ),Equation (4.4) represent the (4.3)
expected number of cases
(4.5)
expected number of cases in each small Ei = ni r,
E
nthe
(4.5) area a
i =
i r,
where
n
is
sample
size
in
each
small
i
area
assuming
constant
risk in
andeach
can be
calculated as
E
n
(4.5)
represent
the
expected
number
of
cases
small
i =
i r,
nd
r
is
the
ratio
of
total
observed
e calculated
as ⇠ P oisson(µ ),
y
|x
(4.3)
i
i in each small area and r is the ratio
where ini is the sample size
of total observed
P
m small area and r is the ratio of total observed
where ni is the sample size successes
in each
over the total sample size, i.e.P the o
i=1 yi
risk
and
can
be
calculated
as
P
overall
success
rate,
r
=
.
y
P
m
E
=
n
r,
i
i
P
n
=
n
r,
(4.5)
over
the
totalrsample
the overall
success rate, r =y
.
sizei in successes
each small
area
is thei size,
ratioi.e.
of total
observed
µ =
E
✓ . and
(4.4)
i=1
spatial
hierarchical
model
g(✓(4.4)
)= X +u +v
i
i
successes over
thei total
sample size, i.e. the overall success rate, r =
Furthermore,
m
i=1 i
Pm
i=1 ni
.
Pm
i=1 yi
Furthermore,
P
l sample
i.e.
the areas.
overall
success
rate,
= then
.
Furthermore,
than
for
arbitrary
Therefore
if vsize
dominates,
the
estimated
risks
m
ni isobserved
the
sample
in reach
small
area
and
r
is the
nwhere
(4.5)
sample
size
in
each
small
area
all
area
and size,
r isEtwo
the
n
i =ratio
i r,of total
i=1 i
m
i=1 i
m
i=1 ni
ratio of t
) represent
the
expected
number
of cases in
each
small then the estimat
Pm
than
for
two
arbitrary
areas.
Therefore
if
v
dominates,
µ
=
E
✓
.
ia spatiali structure
i
g(✓
=iu+dominates
X
+ vuii++↵,
vi +the
↵, estimated
(4.6)
yi i) X
i risks will
i
i
i+
will
display
while
if
then
g(✓
)
=
u
(4.6)
i=1
i
i
P
+
↵,
(4.6)
.e.
the
overall
success
rate,
r
=
.
m
successes
over
the
total
sample
size,
i.e.
the
overall
success
rate
overall
success
rate
i
i=1 ni
risk
and
can
bea area
calculated
ze
in
each
small
andstructure
r as
is the ratio
ofif total
observedthen the estimated ri
shrink
towards
the overall
mean.
will
display
spatial
while
u
dominates
where
X
is Furthermore,
associated
with
measured
covariates
that
are known
or suspected
where
Xi X
isi associated
with
measured
covariates
that are
known
or suspected
g(✓
)
=
+
u
+
v
+
↵,
(4.6)
covariates
where
X
is
associated
with
measured
covaria
i
i
i
i
Peach
i
m
)tesrepresent
the
expected
number
of
cases
in
small
thatu and
arev known
or
suspected
yi
i=1
are
assumed
to
be
independent
where
u
follow
a
Normal
distribution
P
i specific
sample
size,
i.e.to the
overall
success
rate,
r
=
.
m error
+ ushrink
+to
vbe
↵,relevant
towards
mean.
ito
i+
relevant
the
disease,
v(4.6)
and
u
are
area
terms,
and ↵ and ↵
be
to the
disease,
v
and
u
are
area
specific
error
terms,
area-level
error
i
i
nu
i
i g(✓i ) = Xi=1
+
i i + vi + ↵,
i
Ei = ni r,
(4.5) v and u are a
2
to
be
relevant
to
the
disease,
iprior is
i
d risk
withspecific
measured
covariates
are
known
or suspected
with
unknown
variance,
ui ⇠ that
N
(0,
). An
uninformative
inverse-Gamma
and
can
be
calculated
as
area
error
terms,
and
↵
u
unit-level
error
is is
a random
error
termterm
is not
specific.
The link
g(✓i ), isg(✓i ), is
a random
error
that
is area
not area
specific.
Thefunction,
link function,
d covariates
that are
known
orthat
suspected
u and on
v are
assumed
be independent
where
ucovariates
athat
Normal
distr
Xi to
isGamma(0.001,
associated
with
measured
are known
i follow
2 where
2
assumed
,
⇡(
)
⇠
Inv
0.001).
v
arise
from
a
conditional
i and ↵
uu are
u
disease,
v
and
area
specific
error
terms,
is
a
random
error
term
that
is
not
area
spec
defined
as
log(✓
).
This
is
very
similar
to
the
Poisson
regression
model
described
i
i
i
size
in
each
small
area
and
r
is
the
ratio
of
total
observed
fic.
The
link
function,
g(✓
),
is
as log(✓
This
is↵,very
similar to the Poisson (4.6)
regression model described
ui g(✓
are defined
area
specific
error
terms,
and
i ↵
i ).
)
=
X
+
u
+
v
+
2
i
i
i be Normal
irelevant
to
to
the
disease,
v
and
u
are
area
specific
error
auto
regressive
(CAR)
process
where
i
i
with
unknown
variance,
u
⇠
N
(0,
).
An
uninformative
inverse-Gamma
E
=
n
r,
(4.5)
P
i
i
i
u
maccount for spatial
in
Equation
(3.3),
but
includes
area-level
error
terms
that
yi account for spatial
mlreasample
that
isEquation
not
area
specific.
The
link
function,
g(✓
),
is
i=1that
specific.
The
link
function,
g(✓
),
is
in
(3.3),
but
includes
area-level
error
terms
i
i
P
defined
as
log(✓
).
This
is
very
similar
to
the
Po
size, i.e. model
the overall
success
r
=
.
i ◆
m
✓ P rate,
oisson regression
described
2 is not area
i
i=1 nspecific.
v
is
a
random
error
term
that
The
linka func
2
2
j
j⇠i
v
with
measured
covariates
that
are
known
or
suspected
assumed
on
,
⇡(
)
⇠
Inv
Gamma(0.001,
0.001).
v
arise
from
con
variability/dependence.
i
,
.
(4.7)
u
u vdescribed
i |v i ⇠ N
to
the
Poisson
regression
model
variability/dependence.
ni model
ni described
is
very
similar
to the
Poisson
regression
size
in
each
small
area
and
r
is
the
ratio
of
total
observed
in
Equation
(3.3),
but
includes
area-level
erro
r terms that account
spatial
definedfor
as log(✓
).
This
is
very
similar
to
the
Poisson
regression
mo
i
conditional
auto-regressive
The
di↵erence
between
viNormal
and
ui is that
whileterms,
ui are unstructured,
isease,
v
and
u
are
area
specific
error
and
evel
error
terms
that
account
for
spatial
auto
regressive
(CAR)
process
where
i
i
2 ↵
2 vi represent
P
m
An uninformative
inverse-Gamma
prior
isaccount
also assumed
on
Invvi represent
The
di↵erence
between
vi and
ui is
that
while
are unstructured,
u , ⇡( u ) y⇠
includes
area-level
error
terms
that
forui spatial
i
i=1
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
● ● ●
●
●
● ●● ●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
model fitting
WinBugs and R2WinBUGS
Gibbs sampler with four independent chains
100,000 iterations (20,000 burn-in iterations)
Uninformative priors assumed on all βs, π(β) ∼ N(0,1e5)
models
air pollution variables: NO, NO2, O3, and CO
M1: air pollution + race/ethnicity
no models with only air pollution exposure metrics since in public health literature it is
believed that there is bound to be some confounding due to demographic variables
M2: air pollution + race/ethnicity
+ education + prenatal pay
M3: air pollution + race/ethnicity
+ education + prenatal pay
+ average per capita income per small area
Wheezing
Observed/Expected
Model 1: Race only
TOO FLAT
ratio of
obs / exp
3.60
2.15
1.29
Model 2: M1 + edu + prenatal pay
BETTER
Model 3: M2 + imputed income
NO IMPROVEMENT
0.77
0.46
0.00
Doctor diagnosed ear infections, mean
Observed/Expected
NOT SO RARE!
OUTCOME
Model 1: Race only
SMOOTHER
1.97
1.38
0.96
Model 2: M1 + edu + prenatal pay
NO IMPROVEMENT
Model 3: M2 + imputed income
NO IMPROVEMENT
0.67
0.47
0.00
Male chimpanzee aggression
towards females:
a test of the
sexual coercion hypothesis
data
Estimating the impact of
air pollution
using small area estimation
data
cases: paired interactions (dyads) between male and
female chimpanzees
explanatory: !
aggression rates, copulation, offspring and paternity
demographic information on the chimpanzees
(female age, male rank in the population, etc.)
swollen period (exposure time)
Estimating
the
impact
of
response:!
number of copulations in the reproductive window
airachieved
pollution
whether the dyad
a paternity in the
conception window in question (0/1)
using small area estimation
er. The observed copulation counts are clearly overdispersed when
sson distribution. This is verified by the large discrepancy between
54 and the sample variance 44.726. This suggests that distributions
spersion of count data, such as quasi-Poisson or negative binomial,
iate for modeling these data than the obvious first choice of Poisson
copulation
counts
0.25
Histogram of the copulations counts
0.15
poisson
Estimating
the
impact
of
negative binomial
air
pollution
poisson
with random effect for individual
using small area estimation
0.05
0.10
quasi-possion
0.00
Density
0.20
empirical density
poisson density
0
10
20
30
copulation counts
40
50
copulation
counts
mod.pois = glmer(cop_count ~ Parity + Swollen_agg_Z + NonSwol_agg_Z + M_Rank_JTFMDS!
+ Rank_pd_Despot_ratio + Relatedness + fem_age_c
# main effects!
+ Parity*Swollen_agg_Z + Parity*NonSwol_agg_Z
# interactions!
+ Parity*M_Rank_JTFMDS + Parity*Relatedness!
+ Swollen_agg_Z*M_Rank_JTFMDS + Swollen_agg_Z*Relatedness!
+ Swollen_agg_Z*fem_age_c + NonSwol_agg_Z*M_Rank_JTFMDS!
+ NonSwol_agg_Z*Relatedness + NonSwol_agg_Z*fem_age_c!
+ M_Rank_JTFMDS*Rank_pd_Despot_ratio + M_Rank_JTFMDS*fem_age_c!
+ (1|FEMALE) # random effects by female!
+ (1|obs)
# random effect for obs ID as a fix for overdispersion!
+ offset(log(Num_int_swollen)),
# offset!
data = chimp, family = "poisson") # poisson model!
!
Estimating the impact of
air pollution
using small area estimation
all.mod.pois = dredge(mod.pois, rank="AIC", fixed=~offset(log(Num_int_swollen)),!
trace=T, m.min=1)
# model selection
paternity
mod.bin = lmer(Paternity ~ Parity + Swollen_agg_Z + NonSwol_agg_Z + M_Rank_JTFMDS!
+ Rank_pd_Despot_ratio + Relatedness + fem_age_c
# main effects!
+ Parity*Swollen_agg_Z + Parity*NonSwol_agg_Z
# interactions!
+ Parity*M_Rank_JTFMDS + Parity*Relatedness!
+ Swollen_agg_Z*M_Rank_JTFMDS + Swollen_agg_Z*Relatedness!
+ Swollen_agg_Z*fem_age_c + NonSwol_agg_Z*M_Rank_JTFMDS!
+ NonSwol_agg_Z*Relatedness + NonSwol_agg_Z*fem_age_c!
+ M_Rank_JTFMDS*Rank_pd_Despot_ratio + M_Rank_JTFMDS*fem_age_c!
+ (1|FEMALE),
# random effects by female!
data = chimp, family = binomial)
# logistic model!
all.mod.bin = dredge(mod.bin, rank="AIC", trace=T, m.min=1)
# model selection
Estimating the impact of
air pollution
using small area estimation
2011 (UCLA): LAPD 911 calls
2012 (Duke): kiva.com
2013 (Duke + UNC + NCSU): eHarmony
2014 (Duke + UNC + NCSU): GridPoint
data
cases: 1M matches generated by the eHarmony algorithm
success: user (female) or candidate (male) sends
message within 7 days of match
variables:!
demographics
distance between user and candidate (zip code centroid)
habits
preferences in partner
how relaxed on preferences
…
Datacruncherz
Datacruncherz
data
cases: hourly energy usage records from >100 buildings
over 3 years
variables:!
total energy usage (main load)
energy usage by component (HVAC, kitchen, lights,
fridge, etc.)
building data: type of building (restaurant, quick serve,
retail), state, square feet
date/time
…
Model Specification
Model:
Main load for observation i at building j given by yij , i 2 {1, . . . , 36}, j 2 {1, . . . , 75}.
yij ⇠ N(µij ,
2
µij = ↵ij + xi|
j
)
j
⇠ N(µ , diag(w12 , . . . , wp2 ))
µ ⇠ N(0, diag(
2
,...,
1
2
p
))
Non-informative hyper priors on all other parameters
Ga(.01, .01) on precisions or Unif (0, 100) on standard deviations
N(0, 1000) on mean parameters
CougarBait
Team: Cougar Bait
Data Fest 2014 Presentation
Prediction of Main Load Usage
Predicted and Forecast Values, Retail, Building 70
160
75
Predicted and Forecast Values, Rest. Quick Serve, Building 1
150
Actual Data
Fitted Values
Future Predicted Values
95% Intervals
140
120
130
Main Load (kWh)
60
55
40
100
45
110
50
Main Load (kWh)
65
70
Actual Data
Fitted Values
Future Predicted Values
95% Intervals
0
10
20
30
Month
40
0
10
20
30
40
Month
Figure : Model performance on a Quick Serve Restaurant (Building 1) and Retail Store (Building 70), showing comparisons betw
actual values, fitted values, future prediction of Main Load usage for the year 2014, and their 95% CI
CougarBait