Workshop on Complex Sample Survey Design • Introduction
Transcription
Workshop on Complex Sample Survey Design • Introduction
Workshop on Complex Sample Survey Design Session 3: Using the Survey Library • • Introduction Describing the Sample to R: (1) SRS (2) A Stratified Sample (3) A Clustered Sample • Estimating Transport Accident Deaths (1) SRS (2) A Stratified Sample (3) A Clustered Sample (4) A Clustered Stratified Sample • • Estimating Stroke Deaths Summary Introduction In Session 2 we used Monte Carlo simulation, starting with a population with known outcomes, and taking repeated samples using simple, stratified, and clustered random samples. In this session, we’ll learn how to use R’s survey library to create a survey design object for data collected in a survey. This library uses statistical theory to obtain information about the population without the need to do any simulations to compute the standard error. We will also focus on a particular nationwide survey that was undertaken in Thailand, the 2005 Verbal Autopsy Study, with particular reference to determining the true burden of transport accident and stroke deaths in the various regions of Thailand. 2 Describing the Sample to R: (1) SRS 3 After entering a version of R in which the survey library has been installed, this library is loaded using the command library(survey) After specifying the folder containing your data files, you can read the srs1.txt data file created in Session 2 using the command srs1 <- read.table("srs1.txt", h=TRUE) To describe this data table as a SRS sample, the first step is to specify the finite population correction (fpc), which takes account of the reduced error due to the finiteness of the population. This is achieved simply by inserting a variable in the table containing the population size, as follows. srs1$fpc <- 970 To create a design object named srs1.design (say), type srs1.design <- svydesign(id=~1, fpc=~fpc, data=srs1) The argument id=~1 says that subjects are being sampled as 4 individuals, and not in stages (such as in cluster sampling). For basic identifying information for the design, just type its name. To estimate numbers of outcomes in each cause group (defined as the variable vrGrp), use the function svytotal() by typing This estimate of total perinatal deaths agrees largely with that given by the formula on Slide 2.8. Recall that the number of perinatal deaths in the population is 358, so the estimate is 15% too low. R commands are in ex31.Rcm. ............(2) A Stratified Sample 5 These commands (in ex32.Rcm) create a survey object from the sample strats1.txt stratified by location (see Slide 2.11). Note that it requires knowing the numbers in each stratum in the population 1 (482 & 488 for in and outside hospital, respectively). strats1 <- read.table("strats1.txt", h=TRUE) strats1$fpc <- ifelse(strats1$ghos==0, 488, 482) strats1.design <- svydesign(id=~1, strata=~ghos, fpc=~fpc, data=strats1) To estimate numbers of outcomes in each cause group, type svytotal(~vrGrp, design=strats1.design) and get results as follows. In contrast to the estimate given by the SRS, which was 15% too low, this estimate of the number of perinatal deaths in the population is too high by 12%. ______________________________________________________________________________________________________ 1 The R command table(p12$ghos) gives these counts. ......(3) A Clustered Sample 6 These commands (in ex33.Rcm) create a survey object from the sample clusts1.txt (see Slide 2.17). Read sample clusts1 <- read.table("clusts1.txt", header=TRUE) data At each stage you need to tell R how many units (primary, secondary, etc) were in the population sampled from, so the function can calculate the probability weights. In this case fpc1 is 16 (number of super-districts in the population sampled from) and fpc2 is the number of individuals in each selected super-district in the population. FPC <- as.data.frame(table(pop$sDistID)) Create names(FPC) <- c("sDistID","fpc2") design clusts1 <- merge(clusts1, FPC, by="sDistID") variables clusts1$fpc1 <- length(unique(p12$sDistID)) clusts1$ID <- 1:nrow(clusts1) Define design clusts1.design <- svydesign(id=~sDistID+ID, fpc=~fpc1+fpc2, data=clusts1) object Results To estimate numbers of outcomes in each cause group, type svytotal(~vrGrp, design=clusts1.design) The result is This estimate of total perinatal deaths in the population is too low by 37%, but the 95% confidence interval contains the true value (358). 7 Estimating Transport Accident Deaths 8 For a more realistic application, we now consider the problem of estimating the true number of deaths from transport accidents in Thailand, using the Vital Registration (VR) database on reported deaths in 2005. Since cause of death patterns are quite different for children under 5 and older persons, we’ll focus on subjects aged 5 or more years. This population of 384,689 contains 10,761 reported transport accident deaths), but there are many more due to misreporting. Using data from the VA study, Nuntaporn Klinjun (2013) 2 fitted a logistic regression model that gives the total as 21,716, and created a data table containing the probability that a transport accident was the true cause for each record. We’ll use the subset of these data for PHA 12 (N = 24,084) to illustrate various sample designs, with sample sizes 824, the same as was used in the Verbal Autopsy study. __________________________________________________________________________________________________________________-________________ 2 http://www.sat.psu.ac.th/mcs/seminar/slides/parnSlides.pdf 9 The VA sample for this region contains subjects aged 5+ (59 in Krasae Sin district and 765 in Hat Yai district), among whom 43 were assessed to have died from transport accidents. When this count is scaled up by the reciprocal of the sampling fraction n/N (824/24084), it gives the estimate 1256.7 for the total number of transport accident deaths in PHA 12 in 2005, which is quite close to the value 1248.2 obtained by Klinjun (2013) based on her analysis of the VA study data. (This value is obtained simply by summing the estimated probabilities that each subject died due to a transport accident, for all subjects in the population.) If the outcomes were independent, the formula for the standard error of a proportion estimated from a finite population (Slide 2.8), might be used to obtain a 95% confidence interval for this estimate 3. It is 890.1 to 1623.6. ___________________________________________________________ 3 Use R to check this result. ............(1) from a Simple Random Sample 10 Get a SRS (n = 824) from the PHA 12 population, and then use the survey package to estimate the total (see ex34.Rcm). p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE) N <- nrow(p12) # population size iseed <- 410701; set.seed(iseed) n <- 824 # sample size s12 <- p12[sample(N, size=n),] # create sample library(survey) # load survey library s12$fpc <- N s12.design <- svydesign(id=~1, fpc=~fpc, data=s12) ta12tot <- svytotal(~p.transAcc, design=s12.design) This estimate is close to the 1256.7 from scaling up the VA sample total. The 95% confidence interval is much narrower than that given by the formula for the standard error of a proportion. ............(2) from a Stratified Random Sample 11 Can we improve this estimate by stratifying on known variables? In her analysis of the VA study Klinjun found the reported cause to be a strong predictor of transport accident cause of death. ROC curve Prob=0.333 To simplify, use just 3 levels of reported ICD-10 code: (1) transport accident (V), (2) other injury (W or X00-X59), and (3) other cause of death. We’ll use this variable to stratify the sample to estimate the true number of transport accident deaths. The VA survey used simple random sampling from districts, percentages in these three groups being 5.6, 3.4 and 91.0, respectively, so we’ll use sample sizes of 40 in the two smaller strata to match these percentages, leaving 744 in the remaining stratum. The following R commands (continued on next slide) create a stratified random sample and define the corresponding design object (see ex35.Rcm). 12 p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE) str(p12) # list variable names and data types rc <- p12$ICD10 # reported ICD10-code ch <- substr(rc, 1, 1) # its chapter k <- as.integer(substr(rc, 2, 3)) # and integer p12$strat <- ifelse(ch=="V", "1:transAcc", ifelse(ch=="W" | (ch=="X" & k<60), "2:othInj", "3:other")) n <- c(40, 40, 744) # sample sizes in strata p12.1 <- subset(p12, strat=="1:transAcc") p12.2 <- subset(p12, strat=="2:othInj") # define strata p12.3 <- subset(p12, strat=="3:other") # in population 13 iseed <- 410701 set.seed(iseed) ss12.1 <- p12[sample(nrow(p12.1), size=n[1]),] # sample ss12.2 <- p12[sample(nrow(p12.2), size=n[2]),] # from ss12.3 <- p12[sample(nrow(p12.3), size=n[3]),] # strata ss12 <- rbind(ss12.1, ss12.2, ss12.3) # combine samples library(survey) # load survey library N <- table(p12$strat) # strata sizes in population ss12$fpc <- ifelse(ss12$strat=="1:transAcc", N[1], ifelse(ss12$strat=="2:othInj", N[2], N[3])) ss12.design <- svydesign(id=~1, fpc=~fpc, strata=~strat, data=ss12) ta12tot <- svytotal(~p.transAcc, design=ss12.design) This estimate is now lower than the population value (1248.2). The confidence interval is very much shorter, and still includes the population value, so the stratified sample has done very well. ............(3) from a Clustered Sample 14 Use a function to create a clustered random sample with up to round(n/k) (where n = 824) subjects in each of k = 3 randomly selected super-districts of PHA 12, and then define the design object for this cluster sample (see ex36.Rcm). clusters <- function(data, k, n) { clus <- subset(p12, sDistID %in% sample(unique(sDistID), size=k)) samp <- subset(clus, ID %in% unlist(tapply(ID, sDistID, FUN=function(x) x[sample.int(length(x), size=min(length(x), n))]))) samp$fpc1 <- length(unique(p12$sDistID)) FPC <- as.data.frame(table(p12$sDistID)) names(FPC) <- c("sDistID","fpc2") samp <- merge(samp, FPC, by="sDistID") clus.design2 <- svydesign(id=~sDistID+ID, fpc=~fpc1+fpc2, data=samp) svyTotal <- svytotal(~p.transAcc, design=clus.design2) c(total=coef(svyTotal), se=SE(svyTotal)) } Results 15 The estimate is 24% higher than the true population value (1248.2), but still within a 95% confidence interval. The standard error is 33% higher than the value 142.82 from simple random sampling with the same total sample size. To simulate this sampling process 500 times, use the commands nit <- 500; k <- 4; n <- 200; iseed <- 410701 set.seed(iseed) nDeaths <- replicate(nit, clusters(p12, k, n)) apply(nDeaths, 1, FUN=mean) The result shows that the mean is quite close to the true mean, but the clustering has increased the standard error from the SRS. ............(4) from a Clustered Stratified Sample 16 Given that precision is reduced by only 25% by clustering into 3 of the 16 super-districts, and stratifying on three cause of death groups reduces the standard error from 142.8 (for the SRS) to 63.1, it seems desirable to combine these two methods. The following program (ex37.Rcm) does this. p12 <- read.table("ta12pop.txt", h=TRUE, as.is=TRUE) sD <- read.table("superDistricts.txt", header=TRUE) p12 <- merge(p12, sD, by="distID") rc <- p12$ICD10 ch <- substr(rc, 1, 1); k <- as.integer(substr(rc, 2, 3)) p12$strat <- ifelse(ch=="V","1:transAcc", ifelse(ch=="W" | (ch=="X" & k<60), "2:othInj", "3:other")) unique.sdists <- unique(pha12$sDistID) iseed <- 410701 set.seed(iseed) sampled.IDs <- sort(sample(unique.sdists, 3)) # randomly select 3 super-districts stage1 <- subset(pha12, sDistID %in% sampled.IDs) stage1$fpc1 <- length(unique.sdists) # fpc for first stage 17 n <- c(40,40,744) # total sample sizes in strata n.each <- round(n/3) # allocated to each cluster available <- table(stage1$strat, stage1$sDistID) stage1$stage1rows <- 1:nrow(stage1) # now take samples from all cluster.stratum combinations i11 <- sample(subset(stage1, sDistID==sampled.IDs[1] & strat=="1:transAcc")$stage1rows,n.each[1]) i12 <- sample(subset(stage1, sDistID==sampled.IDs[1] & strat=="2:othInj")$stage1rows,n.each[2]) i13 <- sample(subset(stage1, sDistID==sampled.IDs[1] & strat=="3:other")$stage1rows,n.each[3]) i21 <- sample(subset(stage1, sDistID==sampled.IDs[2] & strat=="1:transAcc")$stage1rows,n.each[1]) i22 <- sample(subset(stage1, sDistID==sampled.IDs[2] & strat=="2:othInj")$stage1rows,n.each[2]) i23 <- sample(subset(stage1, sDistID==sampled.IDs[2] & strat=="3:other")$stage1rows,n.each[3]) i31 <- sample(subset(stage1, sDistID==sampled.IDs[3] & strat=="1:transAcc")$stage1rows,n.each[1]) i32 <- sample(subset(stage1, sDistID==sampled.IDs[3] & strat=="2:othInj")$stage1rows,n.each[2]) i33 <- sample(subset(stage1, sDistID==sampled.IDs[3] & strat=="3:other")$stage1rows,n.each[3]) # and combine them to get the sample at the second stage 18 stage2.rows <- c(i11,i12,i13,i21,i22,i23,i31,i32,i33) stage2.data <- stage1[stage2.rows,] # Define the stage 2 design variables stage2.data$no.strata <- 1 stage2.data$fpc2 <- rep(as.vector(available), rep(n.each,3)) The clustered-stratified sample is now ready to be processed by the survey library. Use the following commands to do this. library(survey) design <- svydesign(id=~sDistID+stage1rows, strata=~no.strata+strat, fpc=~fpc1+fpc2, data=stage2.data) svytotal(~p.transAcc, design) But the standard error is larger than for the simple cluster design, so stratifying seems to give no benefit at all! Why? We’ll discuss this later. For now, let’s look at another example. Estimating Stroke Deaths 19 Stroke (ICD-10 code I60-69) is the main cause of death for Thai people. The doughnut plots below compare age-adjusted deaths by gender and 21 cause groups (defined in Nattakit Pipatjaturon’s analysis 3 of the VA data) for Thailand in 2005 & Japan in 2006. Areas of inner circles denote numbers of deaths in the region with lower mortality, and outer rings denote corresponding deaths in the region with higher mortality (Thailand in red, Japan blue). Suicide & stomach cancer are lower in Thailand, but most other causes (transport accident, HIV & endocrine disease) are higher. __________________________________________________________________________________________________________________-________________ 3 http://www.sat.psu.ac.th/mcs/seminar/slides/nattakitSlides.pdf 20 Pipatjaturon’s analysis involved fitting a logistic regression model to the risk of stroke death P, expressed as log(P/(1-P)) = const + factor(province) + factor(gender-age group) + factor(VR cause-location group) Results show higher … and if the reported cause is stroke the risk of stroke death in the Bangkok area. true cause is very likely to be stroke. We’ll use data from Bangkok (PHA 13) to illustrate estimation of stroke deaths. This region reported N = 28,850 deaths aged 5+ in 2005. The VA sample for Bangkok had n = 857 aged 5+ (391, 216 & 250 in districts Nongkam, Ratavee and Kannayaw, 1043: 39/250 1037: 37/216 respectively) among whom 137 (61, 37 & 39) were assessed as stroke deaths. Sample proportions are shown. 1023: 21 61/391 Using the formula on Slide 2.18, the estimate for the total number of stroke deaths in PHA 13 in 2005 is 4627.8. This number is 1.2% lower than the value 4686.2 estimated by Pipatjaturon. Using the survey package we’ll start with a simple random sample of size 857 from the population for PHA 13 (ex38.Rcm). Results This estimate is lower than that obtained by scaling up the sample total to the population total. The 95% confidence interval has a margin for error around 400 on each side of the estimate. Now let’s see if stratifying can improve its accuracy. We’ll run the R commands in ex39.Rcm. 22 Just two strata of size 50 (stroke), 807 (other reported cause) Results The estimated total is closer to the population total (4686.2) …and the confidence interval is halved. 23 Summary 24 In this session we learnt how to create survey design objects using the functions svydesign & svytotal in R’s survey library. This procedure requires additional variables to be created in the data table comprising the sample, including an ID field, one or more fields (depending on whether the design is single stage or multi-stage comprising finite population correction (fpc) factors, and sampling unit definitions. Using three different outcome variables (one binary and the others measured), we applied this method to estimate interval estimates of total counts for numbers of deaths from various causes (reported perinatal and VA-assessed transport accident and stroke) in two PHAs in Thailand. For the two examples involving VA-assessed outcomes, we found that stratification by reported cause improved the accuracy substantially, but there did not appear to be any benefit in using stratification for a cluster design with the transport accidents. Computer Exercises 25 1. Use Notepad to edit the R command file ex31.Rcm in your directory d:/IThai so that iseed is your personal ID. Open the R command window and copy & paste the contents of ex31.Rcm into this window. 2. Repeat this exercise for ex32.Rcm and write your results for vrGrp1:peri here: total_______ SE________ 3. Repeat the above exercise for ex33.Rcm and write your results here: total_______ SE________ 4. Repeat the above exercise for ex34.Rcm and write your results for ta12Tot here: total_______ SE________ 5. Repeat the above exercise for ex35.Rcm and write your results for ta12Tot here: total_______ SE________ 26 6. Repeat the exercise for ex36.Rcm & write your results below: total.p.transAcc________ SE.p.transAcc________ 7. Repeat the exercise for ex37.Rcm and write your results here: total_______ SE________ 8. Repeat the exercise for ex38.Rcm and write your results for p.stroke here: total_______ SE________ 9. Repeat the above exercise after randomly shuffling outcomes and write your results for p.stroke here: total_______ SE________