Experiences of Resampling Approach in the Household Sample Surveys
Transcription
Experiences of Resampling Approach in the Household Sample Surveys
Experiences of Resampling Approach in the Household Sample Surveys Stefano Falorsi, Diego Moretti, Paolo Righi, Claudia Rinaldelli, ISTAT via Adolfo Ravà 150, 00142 Roma, Italy, e-mail: stfalors, dimorett, parighi, rinaldel{@istat.it} 1. Introduction 1 ISTAT calculates sampling variance of common estimates (frequencies, means, totals) using the standard methodology 2 (Särndal et al., 1992; Wolter, 1985) and two software procedures 3 were developed in SAS to implement this methodology in complex sampling designs (Falorsi and Rinaldelli, 1998; ISTAT, 2005); the development of independent and generalized software is a common practice in many National Statistical Institutes. ISTAT has paid special attention to the dissemination of complex statistics in the last years; we mean statistics expressed by non linear functions. The standard methodology and therefore the current software procedures of variance estimation can not be directly applied to these statistics (De Vitiis et al., 2003; Rinaldelli, 2006). Following the main experience gained at ISTAT in using the Balanced Repeated Replications (BRR) technique for estimating the sampling errors of the Laeken indicators 4 (Moretti et al., 2005; Pauselli and Rinaldelli, 2003, 2004; Rinaldelli, 2005), this work reports some empirical comparisons of the resampling techniques - the Delete-A-Group Jackknife (DAGJK), the Extended Delete-A-Group Jackknife (EDAGJK), the Balanced Repeated Replications - in the context of the European Statistics on Income and Living Conditions survey (EU-SILC). 2. The Context: EU-SILC Survey and the Income Measures The Statistics on Income and Living Conditions survey (EU-SILC survey) is aimed at producing estimates on income, living conditions, poverty; among these, there are the European relative poverty measures and inequality indicators (the Laeken indicators). EU-SILC has been carried out by ISTAT under European Regulation since 2004 (Regulation, 2003) and it is based on a two-stage sampling design (municipalities, households); yearly, about 60,000 people are involved in the survey and the selected people are traced and surveyed for four consecutive years. As mentioned in section 1, a main experience in evaluating the sampling variance of the complex Laeken indicators was performed using the Balanced Repeated Replications technique. Here, in order to empirically compare the Delete-A-Group Jackknife and the Extended Delete-A-Group Jackknife methods with BRR and linearization approach, the following income measures were considered: the mean equivalised disposable 1 Claudia Rinaldelli wrote sections 1, 2, 5, 5.2 and 5.4, Diego Moretti wrote sections 4, 5.1 and 5.3, Paolo Righi and Stefano Falorsi wrote section 3. 2 Standard methodology means a well known and widely implemented methodology; literature on sampling theory for finite population provides formulas to calculate sampling variance for the most used sampling designs and estimators. 3 SGCE, Generalized System for Sampling Errors Estimation; software GENESEES, GENEralised Software for Sampling Errors Estimation in Surveys, is on www.istat.it. 4 The European relative poverty measures and inequality indicators calculated using the data of the European Statistics on Income and Living Conditions survey (EU-SILC). income of households by size: one, two, three, four and more people. The equivalised disposable income is obtained dividing the total disposable income by the modified OECD scale to take into account different sizes and compositions of the households. In this empirical study, we used the above income measures - that are estimates of ratio parameters - instead of quantiles functions because, generally, the Jackknife variance estimators of quantiles are not consistent. 3. The Delete-A-Group Jackknife and Extended Delete-A-Group Jackknife methods Usually, in the large scale surveys the goal is to estimate a parameter θ = f ( y1 ,..., y N ) by using a sample statistic θˆ = g ( y1 ,..., y n ) where yi is the observed value of the variable of interest on i-th unit, N is the target population size and n is the sample size. Complex sampling designs and non linear estimators may be used. In this context one standard approach for estimating the sampling error of θˆ is to apply the Taylor series expansion method linearizing the estimator. Then, the standard sample survey variance estimation methods is adopted. An alternative approach is to use replication techniques such as Jackknife and Balanced Repeated Replication (BRR) methods. There are some advantages of using Standard Jackknife in Official Statistics such as: no model assumptions; unit and item nonresponses are easily dealt with; variance of nonlinear statistics and estimation for domains can be easily calculated by external users. Nevertheless, Jackknife becomes computer intensive for large scale surveys because of a large amount of repeated estimation procedures for each sub-sample replicate. In this section we describe the Delete-A-Group Jackknife (DAGJK) and Extended DAGJK (EDAGJK) methods. These methods belong to the strategies for using the Jackknife with reduced replications, while maintaining adequate precision of variance estimation. DAGJK and EDAGJK are based on the following Jackknife procedure: Primary Sample Units (PSUs) in the same stratum are randomly ordered; from this ordering, the PSUs are systematically allocated into G groups; considering the g-th group the replicate g-weights for the elementary k-unit are computed. Let us consider a weighted estimator of the form θˆ = ∑i:1,...,n wi yi of the total θ = ∑i:1,..., N y i with wi = d i γ i the sample weight of unit i, being d i the base weight and γ i a correction factor for non-response or calibration. The g-th (g=1, …,G) replicate estimate is θˆ( g ) = ∑i:1,...,n wi ( g ) yi , with wi ( g ) = d i ( g ) γ i ( g ) the g-th replicate weight of unit i. DAGJK and EDAGJK differ each other for the expression of d i ( g ) . In the DAGJK method the replicate g-weights d i ( g ) are denoted by d iS( g ) and they are expressed as: d iS( g ) ⎧di , when i ∈ h and no PSU of stratum h is included in group g, ⎪ = ⎨0, when i belongs to a PSU included in group g, ⎪n /n ⎩ h h( g ) di , otherwise, [ ] (3.1) in which nh is the PSU sample size of stratum h, and nh( g ) is the PSU sample size of stratum h excluding the PSUs in stratum h and group g. In EDAGJK, when nh < G the replicate g-weights d i ( g ) , denoted by d iE( g ) , are given by diE( g ) ⎧di , when i ∈ h and no PSU of stratum h is included in group g, ⎪ = ⎨di [1 − ( nh − 1)Z ] , when i belongs to a PSU included in group g, ⎪ d (1 + Z ) , otherwise, ⎩ i (3.2) with Z = G / [(G − 1)nh (nh − 1)] . When nh ≥ G then diS( g ) = diE( g ) . The precision of the variance estimates depends on the values of G and nh (h= 1, …, H). Kott (1998b) shows that nh = 5 is the minimum value to obtain a relative bias at most 20% when using the weights (3.1) for the Horvitz-Thompson estimator. This condition is rarely satisfied in the socio-demographic surveys using highly clusterized sampling designs and selecting, usually, a fixed number of few PSUs in each stratum. Kott (2001) shows that using the weights (3.2) the estimate of the variance of the Horvitz-Thompson estimator is unbiased even in the case of nh < 5 . To fix the G value, there does not appear to be any reference in the literature about appropriate number of groups. Anyway it is common practice a choice between 15 and 30 (Kott, 1998b; Rust, 1985), considering that when G is greater than 15 the Student’s t distribution is approximated quite good by the normal distribution. There are some alternative ways to compute the replicate correction factor γ i ( g ) (Kott, 1998b; 2006). Let us take into account the generalized regression (GREG) estimator where the correction factor is n ⎞ ⎛ ⎞⎛ n γ i = 1 + ⎜θ x − ∑ d i x i ⎟⎜ ∑ d i ci x i x′i ⎟ i =1 ⎠ ⎝ ⎠⎝ i =1 −1 n ∑c x y i =1 i i i , where xi is a vector of values of auxiliary variables whose population totals θx = ∑i:1,..., N xi are known and ci is a constant. Using alternatively the DAGJK or EDAGJK replicate base weights expressed respectively by (3.1) or (3.2), a straightforward and standard approach computes the g-th correction factors as γ i( g ) n ⎞ ⎛ ⎞⎛ n = 1 + ⎜θ x − ∑ d i ( g ) x i ⎟⎜ ∑ d i ( g ) ci x i x′i ⎟ i =1 ⎠ ⎝ ⎠⎝ i =1 −1 n ∑c x y i =1 i i i . (3.3) where the weights d i( g ) can be worked out as d iS( g ) or d iE( g ) replicate g-weights. Kott (1998b, 2006) suggests a non-standard formulation for the g-th correction factors expressed by γ i( g ) n ⎞ ⎞⎛ n ⎛ = γ i + ⎜⎜ θx − ∑ di( g ) γ i xi ⎟⎟⎜⎜ ∑ di( g ) γ i ci xi x′i ⎟⎟ i =1 ⎠ ⎠⎝ i =1 ⎝ −1 n ∑ γici xi yi . i =1 (3.4) The (3.4) allows the replicate weights to be fairly close to the associated GREG weights. This appears to reduce the upward bias that unexpected differences between the two can cause (Kott, 1998b). The variation among the replicate estimates enables the calculation of variance estimate as the follows: vJ = ( G −1 G ˆ ∑ θ ( g ) − θˆ G g =1 ) 2 . (3.5) It is worthwhile to note that (3.5) is a suitable estimate of the variance for an estimator being a linear or non linear smooth function of the sample data. Therefore, the same procedure is used to compute the variance nearly all statistics, e.g. the variance of means, ratios and more complex functions (excluding medians and quantiles). 4. The Balanced Repeated Replications Technique (BRR) We briefly describe the basic technique in the ideal case of one stage stratified design; the extension to the multistage design will be showed. Let a finite population be divided in H strata and a sample of two units drawn in each stratum. Selecting a sample unit per stratum at random, we get a sub-sample called replication whose size is the half of the original sample size. Let ah be a coefficient with value +1, if the first unit of h stratum belongs to the replication, -1 if the second one belongs to the replication; then the set of all replications can be described by a matrix 2H x H. Each cell of this matrix has the value +1 or –1. In this matrix these properties hold (Zannella, 1989): 2H ∑a r =1 rh =0 (h=1.....H) (4.1) 2H ∑a r =1 rh a rk = 0 (h=1.....H) (4.2) The (4.1) implies that each sample unit has to be present in the same number of replications, the (4.2) that matrix columns are each other orthogonal. Let θˆ be a linear estimation of a population parameter computed on the whole sample and let θˆ r the estimation computed on the replication r, it can be showed that (Zannella, 1989): 1 θˆ = H 2 2H ∑ θˆ r =1 (4.3) r H 1 2 ˆ ˆ 2 Var( θˆ ) = ∑ (θ - θ h ) 2H r = 1 (4.4) In the case of non linear estimator the (4.3) doesn’t hold and the (4.4) is a variance estimation. If the method is applied to the whole set of replications, the computation cost is high also for a small H (if H=30, replications amount to more than one billion). To decrease the number of replications, it is necessary to look for a R << 2H; selecting R replications at random, implies a variance estimation greater than the estimation computed on the whole set of replications. The Balanced Repeated Replications method has been given by McCarthy (McCarthy, 1969a-1969b) as a solution to this problem. According to BRR method, a subset of R replication is balanced if: R ∑ a a r = 1 rh rk =0 (h,k=1.....H; k ≠ h) (4.5) Moreover if: R ∑ a r = 1 rh =0 (h=1.....H) (4.6) the replications are fully balanced. Variance estimation on R balanced replications give the same estimation of the 2H replications. To get a subset of balanced replications, Hadamard’s matrices must be used. Hadamard’s matrices are special squared matrices to describe k replications. For each subset of k’<k columns (excluding the first column) conditions (4.5) and (4.6) hold; for all the columns only condition (4.5) holds (Zannella, 1989). The Hadamard’s matrix of order 2 is: ⎛ + 1 + 1⎞ ⎜⎜ ⎟⎟ ⎝ + 1 − 1⎠ Let M a Hadamard’s matrix of order k; the iterative procedure to generate a 2k order matrix is (Wolter, 1985): ⎛M M ⎞ ⎜⎜ ⎟⎟ ⎝M − M⎠ It is possible to compute two variance estimation formulas: 1 R ˆ ˆ 2 Var1( θˆ ) = ∑ (θ − θ r ) Rr =1 (4.7) and Var2( θˆ ) = 1 R R ∑ (θˆ − θˆ r =1 c 2 r ) (4.8) where θˆ cr is the estimation of the parameter computed on the sample complement to replication r (that is the replication got multiplying for –1 the matrix coefficients). The average of (4.7) and (4.8) gives a more precise estimation of the sampling variance. EU-SILC is based on a two-stage sampling design with stratification of primary sampling units (PSU), the municipalities; for this reason, BRR basic technique has to be modified. The modification can be summarized in four steps (Wolter, 1985; Zannella, 1989): 1) sampling units are considered at the first stage level; 2) if a stratum contains one sampling unit only, it is collapsed to a neighbour stratum; this operation implies an overestimation of the variance; 3) if a stratum contains more than two sampling units, they are re-grouped in two pseudo-sampling units at random; 4) in selfrepresentative strata, sampling households are considered as PSU and they are divided in two pseudo-PSU as described in step 3. The basic BRR can be applied at the end of the above steps. 5. The Results Sections 5.1-5.4 show some results of the empirical study - performed using the income measures described in section 2 - aimed at comparing the variance estimates calculated by different approaches. 5.1 A First Comparison Tables 1-2 show the results of the comparison of the DAGJK and EDAGJK methods with BRR and linearization approach. With reference to the income measures described in section 2, their relative sampling errors were estimated using these four different approaches. In particular, table 1 reports the results obtained using final weights and thirty random groups in the application of the DAGJK and EDAGJK methods when the replicate g-weights have been worked using formula 3.4. Sampling errors by BRR are closer to those obtained using linearization approach even if sampling errors by the DAGJK and EDAGJK methods are not seriously different from the linearization approach. Sampling errors of the subgroup 4 and more persons are really higher in BRR, DAGJK and EDAGJK than those obtained by linearization. Table 1. Relative sampling errors (%) for the mean equivalised disposable income by linearization approach, BRR, DAGJK and EDAGJK methods (with 30 random groups using formula 3.4) Relative sampling errors % Equivalised DAGJK EDAGJK disposable Linearization BRR (G=30) (G=30) income (mean) Total 16504 0.62% 0.68% 0.76% 0.75% By household size 1 person 14788 1.31% 1.34% 1.11% 1.28% 2 persons 17611 1.08% 1.16% 1.46% 0.99% 3 persons 18094 1.18% 1.29% 1.32% 1.43% 4 and more persons 15924 1.30% 1.81% 2.17% 1.71% Source: EU-SILC 2005 Table 2 reports the results obtained using basic weights and thirty random groups for DAGJK and EDAGJK methods. The obtained relative sampling errors are, in general, a bit higher than those showed in table 1 for EDAGJK while they are nearly equal for the DAGJK. These findings are not surprising because the calibration does not improve seriously the reliability of a ratio estimate. Table 2. Relative sampling errors (%) for the mean equivalised disposable income by linearization approach, DAGJK and EDAGJK methods (with 30 random groups using basic weights) Relative sampling errors % DAGJK EDAGJK Linearization (G=30) (G=30) Total By household size 1 person 2 persons 3 persons 4 and more persons Source: EU-SILC 2005 0.60% 0.79% 0.85% 1.08% 0.93% 1.16% 1.18% 0.92% 1.38% 1.50% 1.99% 1.45% 1.04% 1.46% 1.75% 5.2 The Number of the Random Groups The number of the random groups used in DAGJK and EDAGJK methods influences the results. For this reason, the use of thirty random groups was compared with that of one hundred random groups in DAGJK and EDAGJK. Table 3 reports the sampling errors obtained by DAGJK and EDAGJK - with thirty and one hundred random groups - and by BRR. The results underline that more random groups don’t produce very different relative sampling errors, however DAGJK - using one hundred random groups - produces results closer to BRR, at least for the total. Nevertheless this improvement in sampling errors is quite bounded and it requires more work from a computational point of view. Table 3. Relative sampling errors (%) for the mean equivalised disposable income by BRR, DAGJK and EDAGJK methods (with 30 and 100 random groups using formula 3.4) Relative sampling errors % Total By household size 1 person 2 persons 3 persons 4 and more persons Source: EU-SILC 2005 DAGJK (G=30) DAGJK (G=100) EDAGJK EDAGJK (G=30) (G=100) 0.76% 0.74% 0.75% 0.78% BRR 0.68% 1.11% 1.46% 1.32% 2.17% 1.43% 1.29% 1.45% 2.01% 1.28% 0.99% 1.43% 1.71% 1.50% 1.28% 1.22% 2.06% 1.34% 1.16% 1.29% 1.81% 5.3 Using different replicate calibration correction factors in DAGJK The sampling errors obtained by DAGJK with the replicate weights, calculated by the standard way (formula 3.3), were compared with those obtained by DAGJK based on the replicate weights calculated using formula 3.4. In this second case, the replicate g-weights are adjusted in order to be fairly close to the associated calibrated weights (Kott, 1998a, page 766). Table 4 shows the main results using DAGJK - with 30 and 100 random groups and respectively formulas 3.3 and 3.4 – and BRR. For 100 random groups, formulas 3.3 and 3.4 provide sampling errors closer to those obtained by BRR; for 30 random groups, formula 3.4 estimates values closer to BRR than formula 3.3. Table 4. Relative sampling errors (%) for the mean equivalised disposable income by DAGJK method (with 30 and 100 random groups using replicate calibration correction factors, formula 3.3 and formula 3.4.) and by BRR Relative sampling errors % DAGJK DAGJK DAGJK DAGJK BRR (G=30) (G=30) (G=100) (G=100) (3.3) (3.4) (3.3) (3.4) Total 0.91% 0.76% 0.79% 0.74% 0.68% By household size 1 person 1.69% 1.11% 1.48% 1.43% 1.34% 2 persons 1.46% 1.46% 1.25% 1.29% 1.16% 3 persons 1.82% 1.32% 1.41% 1.45% 1.29% 4 and more persons 2.32% 2.17% 2.16% 2.01% 1.81% Source: EU-SILC 2005 5.4 A Special Case of EDAGJK Method A special case of EDAGJK was experimented. With reference to the non selfrepresentative sampling units (NSR), the application of the EDAGJK method - using 741 random groups - was compared with that implemented using 30 random groups where in both cases the replicate final weights were computed by formula 3.4. This is a special case: the application of the EDAGJK method - for NSR using 741 random groups - corresponds to the application of the Jackknife in which the replicate basic weights are computed using formula 3.2. Table 5 shows the results; the application with 30 random groups (that is easier from a computational point of view) estimates sampling errors that are closer to those obtained with 741 random groups. Table 5. Relative sampling errors (%) for the mean equivalised disposable income by EDAGJK - for NSR units - using (3.4), final weights, 30 and 741 random groups Relative sampling errors % EDAGJK (G=30) (NSR) EDAGJK (G=741) (NSR) Total By household size 1 person 2 persons 3 persons 4 and more persons Source: EU-SILC 2005 0.80% 0.77% 1.55% 1.24% 1.61% 1.70% 1.57% 1.35% 1.46% 1.52% References De Vitiis, C., Di Consiglio, L., Falorsi, S., Pauselli, C., Rinaldelli, C. (2003), “La valutazione dell’errore di campionamento delle stime di povertà relativa”, final report presented at ‘Povertà Regionale ed Esclusione Sociale’, Roma 17-12-03. Falorsi, S., and Rinaldelli, C. (1998), “Un software generalizzato per il calcolo delle stime e degli errori di campionamento”, Statistica Applicata, 10, 2, pp. 217-234. ISTAT (2005), GENESEES 3.0 Manuale Utente e Aspetti Metodologici. Kott, P.S. (1998a), “Using the delete-a-group Jackknife variance estimator in practice”, National Agricultural Statistics Service, Washington DC, pp. 763-768. Kott, P.S. (1998b), “Using the Delete-a-Group Jackknife Variance Estimator in NASS Surveys”, RD Research Report No. RD-98-01. Washington, DC: National Agricultural Statistics Service. Kott, P.S. (2001), “The Delete-a-Group Jackknife”, Journal of Official Statistics, 17, 591-526. Kott, P.S. (2006), “The Delete-a-Group Variance Estimation for the General Regression Estimator Under Poisson Sampling”, Journal of Official Statistics, 22, 759-767. McCarthy, P.J. (1969a), “Pseudoreplication: further evaluation and application of the balanced half-sample technique”, Vital and Health Statistics, Series 2 n.31, National Center for Health Statistics, Public Health Service, Washington, D.C. McCarthy, P.J. (1969b), “Pseudoreplication: Half-Samples”, International Statistical Institute, 37, pp. 239-264. Review of the Moretti, D., Pauselli, C., Rinaldelli, C. (2005), “La stima della varianza campionaria di indicatori complessi di povertà e disuguaglianza”, Statistica Applicata, Vol. 17, N. 4, pp. 529-550. Pauselli, C., and Rinaldelli, C. (2003), “La valutazione dell’errore di campionamento delle stime di povertà relativa secondo la tecnica Replicazioni Bilanciate Ripetute”, Rivista di Statistica Ufficiale, 2/2003, pp. 7-22. Pauselli, C., and Rinaldelli, C. (2004), “Stime di povertà relativa: la valutazione dell’errore campionario secondo le Replicazioni Bilanciate Ripetute”, Statistica Applicata, Vol.16, n.1, pp. 89-101. REGULATION (EC) No 1177/2003 of the EUROPEAN PARLIAMENT and of the COUNCIL of 16 June 2003 concerning Community statistics on income and living conditions (EU-SILC), 3.7.2003 L 165/1 Official Journal of the European Union. Rinaldelli, C. (2005), “Statistiche complesse e software”, Statistica & Società, Anno III, n.2, 01.2005, pp. 27-29. Rinaldelli, C. (2006), “Experiences of variance estimation for relative poverty measures and inequality indicators”, COMPSTAT Proceedings in Computational Statistics, pp. 1465-1472, Italy 2006. Rust, K. (1985), “Variance Estimation for Complex Estimators in Sample Surveys”, Journal of Official Statistics, 1, 381-397. Särndal, C.E., Swensson, B., Wretman, J. (1992), Model assisted survey sampling. New York Springer-Verlag. Wolter, K.M. (1985), Introduction to variance estimation, New York Springer-Verlag. Zannella, F. (1989), Tecniche di stima della varianza campionaria, Manuale di tecniche di indagine, Collana ISTAT Note e Relazioni, anno 1989, n.1, vol.5.