Statistical Modeling
Transcription
Statistical Modeling
Estimating the impact of air pollution using small area estimation Male chimpanzee aggression towards females: a test of the sexual coercion hypothesis mine cetinkaya-rundel statistical science Estimating the impact of air pollution using small area estimation data 91344 data 91340 91345 91331 91343 91335 91352 91504 91406 91501 91411 91401 91607 91502 9150591506 91201 91601 91107 91105 91205 91204 90027 90210 90046 90069 91702 91104 91206 91203 91602 91403 91423 91103 91101 91106 91108 90039 90065 90028 90038 90029 90031 90031 90026 90004 90012 90020 90024 90212 90005 90057 90017 90035 91754 91755 90033 90019 90006 90025 90013 90063 90015 90064 90402 90403 90034 90016 90007 90404 90022 90640 90232 90023 90640 90401 90405 90011 90640 90066 90037 90291 91767 9004890036 90292 90066 90601 90601 90602 90003 90044 90280 90280 90304 90262 90061 90250 90280 90222 90247 90278 90254 90277 90248 90504 90746 90502 90745 90745 90807 90806 90710 90744 90813 90813 90802 9073190802 90731 90802 0.4 0.4 0.4 0.2 Freeways No data Wheezing Freeways No data 1.0 Medication useinfor or asthma Wheezing thewheezing last 12 months 1.0 eda DoctorWheezing diagnosed infections orear asthma 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Freeways Freeways No data data No 0.0 Doctor diagnosed asthma 1.0 1.0 Freeways Freeways No data No data Medication use for wheezing or asthma Figure 1.3: Distribution of observed rates of outcomes of interest Freeways No data 0.0 0.0 0.8 0.0 Wheezing or asthma 0.0 Freeways Freeways No data No data 0.2 0.2 0.0 0.0 1.0 0.8 0.8 0.8 0.6 0.610 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 Freeways No data 0.0 Freeways No data 0.0 30 40 0.9 25 35 30 25 Freeways Monitoring stations No data Freeways Freeways Monitoringstations stations Monitoring Nodata data No 20 Average NO, entire pregnancy (LUR, optimal), per 20 ppb Freeways Monitoring stations No data 50 3 Average PM2.5,pregnancy entire pregnancy, per 10µg/m Average NO2, entire (LUR, optimal), per 10 ppb 45 23 40 30 22 35 21 30 25 20 25 19 20 20 18 15 15 17 Freeways Monitoring stations No data 10 Average O3, entire pregnancy, per 10 ppb (10am − 6pm) 24 35 60 eda 3 6pm) Average O3, entire per 10 ppb (10am − Average PM10,pregnancy, entire pregnancy, per 10µg/m 10 0.5 60 45 45 40 35 35 30 30 Freeways Freeways Monitoring stations Monitoring stations No data No data Average PM2.5, entire pregnancy, per 10µg/m3 1.4 1.3 1.2 50 15 0.6 50 40 Figure 1.4: Distribution of air pollution exposure estimates 55 20 0.7 55 16 Average CO, entire pregnancy, per 1 ppm 0.8 25 20 25 24 23 22 1.1 45 21 13 1.0 20 40 0.9 19 35 0.8 18 30 Freeways Monitoring stations No data 25 20 0.7 Freeways Monitoring stations No data 0.6 0.5 Freeways Monitoring stations No data 17 16 motivation - PH Develop a model that makes use of information readily available in public data bases instead of having to collect data via costly follow-up surveys motivation - STAT rare outcomes + low sample sizes from each area ↘︎ Wheezing 1.0 0.8 Wheezing in the last 12 months fluctuations highlighted might be due to chance, not real changes in the underlying risk factors 0.6 ↘︎ smoothing 0.4 0.2 Freeways No data 0.0 Freeways No data = E i ✓i . (4.4) can be calculated as riskarea andassuming can as (4.5) µ be = constant Ecalculated ✓ . risk and Ei = ni r, (4.4) i i 1996]. i ributions [Mollie, Then, yi |xi ⇠ P oisson(µ Ei ini ),Equation (4.4) represent the (4.3) expected number of cases (4.5) expected number of cases in each small Ei = ni r, E nthe (4.5) area a i = i r, where n is sample size in each small i area assuming constant risk in andeach can be calculated as E n (4.5) represent the expected number of cases small i = i r, nd r is the ratio of total observed e calculated as ⇠ P oisson(µ ), y |x (4.3) i i in each small area and r is the ratio where ini is the sample size of total observed P m small area and r is the ratio of total observed where ni is the sample size successes in each over the total sample size, i.e.P the o i=1 yi risk and can be calculated as P overall success rate, r = . y P m E = n r, i i P n = n r, (4.5) over the totalrsample the overall success rate, r =y . sizei in successes each small area is thei size, ratioi.e. of total observed µ = E ✓ . and (4.4) i=1 spatial hierarchical model g(✓(4.4) )= X +u +v i i successes over thei total sample size, i.e. the overall success rate, r = Furthermore, m i=1 i Pm i=1 ni . Pm i=1 yi Furthermore, P l sample i.e. the areas. overall success rate, = then . Furthermore, than for arbitrary Therefore if vsize dominates, the estimated risks m ni isobserved the sample in reach small area and r is the nwhere (4.5) sample size in each small area all area and size, r isEtwo the n i =ratio i r,of total i=1 i m i=1 i m i=1 ni ratio of t ) represent the expected number of cases in each small then the estimat Pm than for two arbitrary areas. Therefore if v dominates, µ = E ✓ . ia spatiali structure i g(✓ =iu+dominates X + vuii++↵, vi +the ↵, estimated (4.6) yi i) X i risks will i i i+ will display while if then g(✓ ) = u (4.6) i=1 i i P + ↵, (4.6) .e. the overall success rate, r = . m successes over the total sample size, i.e. the overall success rate overall success rate i i=1 ni risk and can bea area calculated ze in each small andstructure r as is the ratio ofif total observedthen the estimated ri shrink towards the overall mean. will display spatial while u dominates where X is Furthermore, associated with measured covariates that are known or suspected where Xi X isi associated with measured covariates that are known or suspected g(✓ ) = + u + v + ↵, (4.6) covariates where X is associated with measured covaria i i i i Peach i m )tesrepresent the expected number of cases in small thatu and arev known or suspected yi i=1 are assumed to be independent where u follow a Normal distribution P i specific sample size, i.e.to the overall success rate, r = . m error + ushrink +to vbe ↵,relevant towards mean. ito i+ relevant the disease, v(4.6) and u are area terms, and ↵ and ↵ be to the disease, v and u are area specific error terms, area-level error i i nu i i g(✓i ) = Xi=1 + i i + vi + ↵, i Ei = ni r, (4.5) v and u are a 2 to be relevant to the disease, iprior is i d risk withspecific measured covariates are known or suspected with unknown variance, ui ⇠ that N (0, ). An uninformative inverse-Gamma and can be calculated as area error terms, and ↵ u unit-level error is is a random error termterm is not specific. The link g(✓i ), isg(✓i ), is a random error that is area not area specific. Thefunction, link function, d covariates that are known orthat suspected u and on v are assumed be independent where ucovariates athat Normal distr Xi to isGamma(0.001, associated with measured are known i follow 2 where 2 assumed , ⇡( ) ⇠ Inv 0.001). v arise from a conditional i and ↵ uu are u disease, v and area specific error terms, is a random error term that is not area spec defined as log(✓ ). This is very similar to the Poisson regression model described i i i size in each small area and r is the ratio of total observed fic. The link function, g(✓ ), is as log(✓ This is↵,very similar to the Poisson (4.6) regression model described ui g(✓ are defined area specific error terms, and i ↵ i ). ) = X + u + v + 2 i i i be Normal irelevant to to the disease, v and u are area specific error auto regressive (CAR) process where i i with unknown variance, u ⇠ N (0, ). An uninformative inverse-Gamma E = n r, (4.5) P i i i u maccount for spatial in Equation (3.3), but includes area-level error terms that yi account for spatial mlreasample that isEquation not area specific. The link function, g(✓ ), is i=1that specific. The link function, g(✓ ), is in (3.3), but includes area-level error terms i i P defined as log(✓ ). This is very similar to the Po size, i.e. model the overall success r = . i ◆ m ✓ P rate, oisson regression described 2 is not area i i=1 nspecific. v is a random error term that The linka func 2 2 j j⇠i v with measured covariates that are known or suspected assumed on , ⇡( ) ⇠ Inv Gamma(0.001, 0.001). v arise from con variability/dependence. i , . (4.7) u u vdescribed i |v i ⇠ N to the Poisson regression model variability/dependence. ni model ni described is very similar to the Poisson regression size in each small area and r is the ratio of total observed in Equation (3.3), but includes area-level erro r terms that account spatial definedfor as log(✓ ). This is very similar to the Poisson regression mo i conditional auto-regressive The di↵erence between viNormal and ui is that whileterms, ui are unstructured, isease, v and u are area specific error and evel error terms that account for spatial auto regressive (CAR) process where i i 2 ↵ 2 vi represent P m An uninformative inverse-Gamma prior isaccount also assumed on Invvi represent The di↵erence between vi and ui is that while are unstructured, u , ⇡( u ) y⇠ includes area-level error terms that forui spatial i i=1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● model fitting WinBugs and R2WinBUGS Gibbs sampler with four independent chains 100,000 iterations (20,000 burn-in iterations) Uninformative priors assumed on all βs, π(β) ∼ N(0,1e5) models air pollution variables: NO, NO2, O3, and CO M1: air pollution + race/ethnicity no models with only air pollution exposure metrics since in public health literature it is believed that there is bound to be some confounding due to demographic variables M2: air pollution + race/ethnicity + education + prenatal pay M3: air pollution + race/ethnicity + education + prenatal pay + average per capita income per small area Wheezing Observed/Expected Model 1: Race only TOO FLAT ratio of obs / exp 3.60 2.15 1.29 Model 2: M1 + edu + prenatal pay BETTER Model 3: M2 + imputed income NO IMPROVEMENT 0.77 0.46 0.00 Doctor diagnosed ear infections, mean Observed/Expected NOT SO RARE! OUTCOME Model 1: Race only SMOOTHER 1.97 1.38 0.96 Model 2: M1 + edu + prenatal pay NO IMPROVEMENT Model 3: M2 + imputed income NO IMPROVEMENT 0.67 0.47 0.00 Male chimpanzee aggression towards females: a test of the sexual coercion hypothesis data Estimating the impact of air pollution using small area estimation data cases: paired interactions (dyads) between male and female chimpanzees explanatory: ! aggression rates, copulation, offspring and paternity demographic information on the chimpanzees (female age, male rank in the population, etc.) swollen period (exposure time) Estimating the impact of response:! number of copulations in the reproductive window airachieved pollution whether the dyad a paternity in the conception window in question (0/1) using small area estimation er. The observed copulation counts are clearly overdispersed when sson distribution. This is verified by the large discrepancy between 54 and the sample variance 44.726. This suggests that distributions spersion of count data, such as quasi-Poisson or negative binomial, iate for modeling these data than the obvious first choice of Poisson copulation counts 0.25 Histogram of the copulations counts 0.15 poisson Estimating the impact of negative binomial air pollution poisson with random effect for individual using small area estimation 0.05 0.10 quasi-possion 0.00 Density 0.20 empirical density poisson density 0 10 20 30 copulation counts 40 50 copulation counts mod.pois = glmer(cop_count ~ Parity + Swollen_agg_Z + NonSwol_agg_Z + M_Rank_JTFMDS! + Rank_pd_Despot_ratio + Relatedness + fem_age_c # main effects! + Parity*Swollen_agg_Z + Parity*NonSwol_agg_Z # interactions! + Parity*M_Rank_JTFMDS + Parity*Relatedness! + Swollen_agg_Z*M_Rank_JTFMDS + Swollen_agg_Z*Relatedness! + Swollen_agg_Z*fem_age_c + NonSwol_agg_Z*M_Rank_JTFMDS! + NonSwol_agg_Z*Relatedness + NonSwol_agg_Z*fem_age_c! + M_Rank_JTFMDS*Rank_pd_Despot_ratio + M_Rank_JTFMDS*fem_age_c! + (1|FEMALE) # random effects by female! + (1|obs) # random effect for obs ID as a fix for overdispersion! + offset(log(Num_int_swollen)), # offset! data = chimp, family = "poisson") # poisson model! ! Estimating the impact of air pollution using small area estimation all.mod.pois = dredge(mod.pois, rank="AIC", fixed=~offset(log(Num_int_swollen)),! trace=T, m.min=1) # model selection paternity mod.bin = lmer(Paternity ~ Parity + Swollen_agg_Z + NonSwol_agg_Z + M_Rank_JTFMDS! + Rank_pd_Despot_ratio + Relatedness + fem_age_c # main effects! + Parity*Swollen_agg_Z + Parity*NonSwol_agg_Z # interactions! + Parity*M_Rank_JTFMDS + Parity*Relatedness! + Swollen_agg_Z*M_Rank_JTFMDS + Swollen_agg_Z*Relatedness! + Swollen_agg_Z*fem_age_c + NonSwol_agg_Z*M_Rank_JTFMDS! + NonSwol_agg_Z*Relatedness + NonSwol_agg_Z*fem_age_c! + M_Rank_JTFMDS*Rank_pd_Despot_ratio + M_Rank_JTFMDS*fem_age_c! + (1|FEMALE), # random effects by female! data = chimp, family = binomial) # logistic model! all.mod.bin = dredge(mod.bin, rank="AIC", trace=T, m.min=1) # model selection Estimating the impact of air pollution using small area estimation 2011 (UCLA): LAPD 911 calls 2012 (Duke): kiva.com 2013 (Duke + UNC + NCSU): eHarmony 2014 (Duke + UNC + NCSU): GridPoint data cases: 1M matches generated by the eHarmony algorithm success: user (female) or candidate (male) sends message within 7 days of match variables:! demographics distance between user and candidate (zip code centroid) habits preferences in partner how relaxed on preferences … Datacruncherz Datacruncherz data cases: hourly energy usage records from >100 buildings over 3 years variables:! total energy usage (main load) energy usage by component (HVAC, kitchen, lights, fridge, etc.) building data: type of building (restaurant, quick serve, retail), state, square feet date/time … Model Specification Model: Main load for observation i at building j given by yij , i 2 {1, . . . , 36}, j 2 {1, . . . , 75}. yij ⇠ N(µij , 2 µij = ↵ij + xi| j ) j ⇠ N(µ , diag(w12 , . . . , wp2 )) µ ⇠ N(0, diag( 2 ,..., 1 2 p )) Non-informative hyper priors on all other parameters Ga(.01, .01) on precisions or Unif (0, 100) on standard deviations N(0, 1000) on mean parameters CougarBait Team: Cougar Bait Data Fest 2014 Presentation Prediction of Main Load Usage Predicted and Forecast Values, Retail, Building 70 160 75 Predicted and Forecast Values, Rest. Quick Serve, Building 1 150 Actual Data Fitted Values Future Predicted Values 95% Intervals 140 120 130 Main Load (kWh) 60 55 40 100 45 110 50 Main Load (kWh) 65 70 Actual Data Fitted Values Future Predicted Values 95% Intervals 0 10 20 30 Month 40 0 10 20 30 40 Month Figure : Model performance on a Quick Serve Restaurant (Building 1) and Retail Store (Building 70), showing comparisons betw actual values, fitted values, future prediction of Main Load usage for the year 2014, and their 95% CI CougarBait