Experiences of Resampling Approach in the Household Sample Surveys

Transcription

Experiences of Resampling Approach in the Household Sample Surveys
Experiences of Resampling Approach in the Household Sample
Surveys
Stefano Falorsi, Diego Moretti, Paolo Righi, Claudia Rinaldelli, ISTAT via Adolfo
Ravà 150, 00142 Roma, Italy, e-mail: stfalors, dimorett, parighi, rinaldel{@istat.it}
1. Introduction 1
ISTAT calculates sampling variance of common estimates (frequencies, means,
totals) using the standard methodology 2 (Särndal et al., 1992; Wolter, 1985) and two
software procedures 3 were developed in SAS to implement this methodology in
complex sampling designs (Falorsi and Rinaldelli, 1998; ISTAT, 2005); the
development of independent and generalized software is a common practice in many
National Statistical Institutes.
ISTAT has paid special attention to the dissemination of complex statistics in the last
years; we mean statistics expressed by non linear functions. The standard
methodology and therefore the current software procedures of variance estimation
can not be directly applied to these statistics (De Vitiis et al., 2003; Rinaldelli, 2006).
Following the main experience gained at ISTAT in using the Balanced Repeated
Replications (BRR) technique for estimating the sampling errors of the Laeken
indicators 4 (Moretti et al., 2005; Pauselli and Rinaldelli, 2003, 2004; Rinaldelli, 2005),
this work reports some empirical comparisons of the resampling techniques - the
Delete-A-Group Jackknife (DAGJK), the Extended Delete-A-Group Jackknife
(EDAGJK), the Balanced Repeated Replications - in the context of the European
Statistics on Income and Living Conditions survey (EU-SILC).
2. The Context: EU-SILC Survey and the Income Measures
The Statistics on Income and Living Conditions survey (EU-SILC survey) is aimed at
producing estimates on income, living conditions, poverty; among these, there are
the European relative poverty measures and inequality indicators (the Laeken
indicators). EU-SILC has been carried out by ISTAT under European Regulation
since 2004 (Regulation, 2003) and it is based on a two-stage sampling design
(municipalities, households); yearly, about 60,000 people are involved in the survey
and the selected people are traced and surveyed for four consecutive years.
As mentioned in section 1, a main experience in evaluating the sampling variance of
the complex Laeken indicators was performed using the Balanced Repeated
Replications technique.
Here, in order to empirically compare the Delete-A-Group Jackknife and the
Extended Delete-A-Group Jackknife methods with BRR and linearization approach,
the following income measures were considered: the mean equivalised disposable
1 Claudia Rinaldelli wrote sections 1, 2, 5, 5.2 and 5.4, Diego Moretti wrote sections 4, 5.1 and 5.3,
Paolo Righi and Stefano Falorsi wrote section 3.
2 Standard methodology means a well known and widely implemented methodology; literature on
sampling theory for finite population provides formulas to calculate sampling variance for the most
used sampling designs and estimators.
3 SGCE, Generalized System for Sampling Errors Estimation; software GENESEES, GENEralised
Software for Sampling Errors Estimation in Surveys, is on www.istat.it.
4 The European relative poverty measures and inequality indicators calculated using the data of the
European Statistics on Income and Living Conditions survey (EU-SILC).
income of households by size: one, two, three, four and more people. The
equivalised disposable income is obtained dividing the total disposable income by the
modified OECD scale to take into account different sizes and compositions of the
households.
In this empirical study, we used the above income measures - that are estimates of
ratio parameters - instead of quantiles functions because, generally, the Jackknife
variance estimators of quantiles are not consistent.
3. The Delete-A-Group Jackknife and Extended Delete-A-Group Jackknife
methods
Usually, in the large scale surveys the goal is to estimate a parameter
θ = f ( y1 ,..., y N ) by using a sample statistic θˆ = g ( y1 ,..., y n ) where yi is the observed
value of the variable of interest on i-th unit, N is the target population size and n is the
sample size. Complex sampling designs and non linear estimators may be used. In
this context one standard approach for estimating the sampling error of θˆ is to apply
the Taylor series expansion method linearizing the estimator. Then, the standard
sample survey variance estimation methods is adopted. An alternative approach is to
use replication techniques such as Jackknife and Balanced Repeated Replication
(BRR) methods. There are some advantages of using Standard Jackknife in Official
Statistics such as: no model assumptions; unit and item nonresponses are easily
dealt with; variance of nonlinear statistics and estimation for domains can be easily
calculated by external users. Nevertheless, Jackknife becomes computer intensive
for large scale surveys because of a large amount of repeated estimation procedures
for each sub-sample replicate. In this section we describe the Delete-A-Group
Jackknife (DAGJK) and Extended DAGJK (EDAGJK) methods. These methods
belong to the strategies for using the Jackknife with reduced replications, while
maintaining adequate precision of variance estimation.
DAGJK and EDAGJK are based on the following Jackknife procedure:
Primary Sample Units (PSUs) in the same stratum are randomly ordered; from this
ordering, the PSUs are systematically allocated into G groups; considering the g-th
group the replicate g-weights for the elementary k-unit are computed.
Let us consider a weighted estimator of the form θˆ = ∑i:1,...,n wi yi of the total
θ = ∑i:1,..., N y i with wi = d i γ i the sample weight of unit i, being d i the base weight and
γ i a correction factor for non-response or calibration. The g-th (g=1, …,G) replicate
estimate is θˆ( g ) = ∑i:1,...,n wi ( g ) yi , with wi ( g ) = d i ( g ) γ i ( g ) the g-th replicate weight of unit i.
DAGJK and EDAGJK differ each other for the expression of d i ( g ) .
In the DAGJK method the replicate g-weights d i ( g ) are denoted by d iS( g ) and they are
expressed as:
d iS( g )
⎧di , when i ∈ h and no PSU of stratum h is included in group g,
⎪
= ⎨0, when i belongs to a PSU included in group g,
⎪n /n
⎩ h h( g ) di , otherwise,
[
]
(3.1)
in which nh is the PSU sample size of stratum h, and nh( g ) is the PSU sample size of
stratum h excluding the PSUs in stratum h and group g.
In EDAGJK, when nh < G the replicate g-weights d i ( g ) , denoted by d iE( g ) , are given by
diE( g )
⎧di , when i ∈ h and no PSU of stratum h is included in group g,
⎪
= ⎨di [1 − ( nh − 1)Z ] , when i belongs to a PSU included in group g,
⎪ d (1 + Z ) , otherwise,
⎩ i
(3.2)
with Z = G / [(G − 1)nh (nh − 1)] . When nh ≥ G then diS( g ) = diE( g ) .
The precision of the variance estimates depends on the values of G and nh (h= 1, …,
H). Kott (1998b) shows that nh = 5 is the minimum value to obtain a relative bias at
most 20% when using the weights (3.1) for the Horvitz-Thompson estimator. This
condition is rarely satisfied in the socio-demographic surveys using highly clusterized
sampling designs and selecting, usually, a fixed number of few PSUs in each
stratum. Kott (2001) shows that using the weights (3.2) the estimate of the variance
of the Horvitz-Thompson estimator is unbiased even in the case of nh < 5 . To fix the
G value, there does not appear to be any reference in the literature about appropriate
number of groups. Anyway it is common practice a choice between 15 and 30 (Kott,
1998b; Rust, 1985), considering that when G is greater than 15 the Student’s t
distribution is approximated quite good by the normal distribution.
There are some alternative ways to compute the replicate correction factor γ i ( g )
(Kott, 1998b; 2006). Let us take into account the generalized regression (GREG)
estimator where the correction factor is
n
⎞
⎛
⎞⎛ n
γ i = 1 + ⎜θ x − ∑ d i x i ⎟⎜ ∑ d i ci x i x′i ⎟
i =1
⎠
⎝
⎠⎝ i =1
−1 n
∑c x y
i =1
i
i
i
,
where xi is a vector of values of auxiliary variables whose population totals
θx = ∑i:1,..., N xi are known and ci is a constant.
Using alternatively the DAGJK or EDAGJK replicate base weights expressed
respectively by (3.1) or (3.2), a straightforward and standard approach computes the
g-th correction factors as
γ i( g )
n
⎞
⎛
⎞⎛ n
= 1 + ⎜θ x − ∑ d i ( g ) x i ⎟⎜ ∑ d i ( g ) ci x i x′i ⎟
i =1
⎠
⎝
⎠⎝ i =1
−1 n
∑c x y
i =1
i
i
i
.
(3.3)
where the weights d i( g ) can be worked out as d iS( g ) or d iE( g ) replicate g-weights. Kott
(1998b, 2006) suggests a non-standard formulation for the g-th correction factors
expressed by
γ i( g )
n
⎞
⎞⎛ n
⎛
= γ i + ⎜⎜ θx − ∑ di( g ) γ i xi ⎟⎟⎜⎜ ∑ di( g ) γ i ci xi x′i ⎟⎟
i =1
⎠
⎠⎝ i =1
⎝
−1
n
∑ γici xi yi .
i =1
(3.4)
The (3.4) allows the replicate weights to be fairly close to the associated GREG
weights. This appears to reduce the upward bias that unexpected differences
between the two can cause (Kott, 1998b).
The variation among the replicate estimates enables the calculation of variance
estimate as the follows:
vJ =
(
G −1 G ˆ
∑ θ ( g ) − θˆ
G g =1
)
2
.
(3.5)
It is worthwhile to note that (3.5) is a suitable estimate of the variance for an
estimator being a linear or non linear smooth function of the sample data. Therefore,
the same procedure is used to compute the variance nearly all statistics, e.g. the
variance of means, ratios and more complex functions (excluding medians and
quantiles).
4. The Balanced Repeated Replications Technique (BRR)
We briefly describe the basic technique in the ideal case of one stage stratified
design; the extension to the multistage design will be showed. Let a finite population
be divided in H strata and a sample of two units drawn in each stratum. Selecting a
sample unit per stratum at random, we get a sub-sample called replication whose
size is the half of the original sample size. Let ah be a coefficient with value +1, if the
first unit of h stratum belongs to the replication, -1 if the second one belongs to the
replication; then the set of all replications can be described by a matrix 2H x H. Each
cell of this matrix has the value +1 or –1.
In this matrix these properties hold (Zannella, 1989):
2H
∑a
r =1
rh
=0
(h=1.....H)
(4.1)
2H
∑a
r =1
rh
a rk = 0
(h=1.....H)
(4.2)
The (4.1) implies that each sample unit has to be present in the same number of
replications, the (4.2) that matrix columns are each other orthogonal.
Let θˆ be a linear estimation of a population parameter computed on the whole
sample and let θˆ r the estimation computed on the replication r, it can be showed
that (Zannella, 1989):
1
θˆ = H
2
2H
∑ θˆ
r =1
(4.3)
r
H
1 2 ˆ ˆ 2
Var( θˆ ) =
∑ (θ - θ h )
2H r = 1
(4.4)
In the case of non linear estimator the (4.3) doesn’t hold and the (4.4) is a variance
estimation. If the method is applied to the whole set of replications, the computation
cost is high also for a small H (if H=30, replications amount to more than one billion).
To decrease the number of replications, it is necessary to look for a R << 2H;
selecting R replications at random, implies a variance estimation greater than the
estimation computed on the whole set of replications. The Balanced Repeated
Replications method has been given by McCarthy (McCarthy, 1969a-1969b) as a
solution to this problem. According to BRR method, a subset of R replication is
balanced if:
R
∑ a a
r = 1
rh
rk
=0
(h,k=1.....H; k ≠ h)
(4.5)
Moreover if:
R
∑ a
r = 1
rh
=0
(h=1.....H)
(4.6)
the replications are fully balanced.
Variance estimation on R balanced replications give the same estimation of the 2H
replications.
To get a subset of balanced replications, Hadamard’s matrices must be used.
Hadamard’s matrices are special squared matrices to describe k replications. For
each subset of k’<k columns (excluding the first column) conditions (4.5) and (4.6)
hold; for all the columns only condition (4.5) holds (Zannella, 1989).
The Hadamard’s matrix of order 2 is:
⎛ + 1 + 1⎞
⎜⎜
⎟⎟
⎝ + 1 − 1⎠
Let M a Hadamard’s matrix of order k; the iterative procedure to generate a 2k order
matrix is (Wolter, 1985):
⎛M M ⎞
⎜⎜
⎟⎟
⎝M − M⎠
It is possible to compute two variance estimation formulas:
1 R ˆ ˆ 2
Var1( θˆ ) =
∑ (θ − θ r )
Rr =1
(4.7)
and
Var2( θˆ ) =
1
R
R
∑ (θˆ − θˆ
r =1
c 2
r
)
(4.8)
where θˆ cr is the estimation of the parameter computed on the sample complement to
replication r (that is the replication got multiplying for –1 the matrix coefficients).
The average of (4.7) and (4.8) gives a more precise estimation of the sampling
variance. EU-SILC is based on a two-stage sampling design with stratification of
primary sampling units (PSU), the municipalities; for this reason, BRR basic
technique has to be modified. The modification can be summarized in four steps
(Wolter, 1985; Zannella, 1989):
1) sampling units are considered at the first stage level; 2) if a stratum contains
one sampling unit only, it is collapsed to a neighbour stratum; this operation implies
an overestimation of the variance; 3) if a stratum contains more than two sampling
units, they are re-grouped in two pseudo-sampling units at random; 4) in selfrepresentative strata, sampling households are considered as PSU and they are
divided in two pseudo-PSU as described in step 3.
The basic BRR can be applied at the end of the above steps.
5. The Results
Sections 5.1-5.4 show some results of the empirical study - performed using the
income measures described in section 2 - aimed at comparing the variance
estimates calculated by different approaches.
5.1 A First Comparison
Tables 1-2 show the results of the comparison of the DAGJK and EDAGJK methods
with BRR and linearization approach. With reference to the income measures
described in section 2, their relative sampling errors were estimated using these four
different approaches. In particular, table 1 reports the results obtained using final
weights and thirty random groups in the application of the DAGJK and EDAGJK
methods when the replicate g-weights have been worked using formula 3.4.
Sampling errors by BRR are closer to those obtained using linearization approach
even if sampling errors by the DAGJK and EDAGJK methods are not seriously
different from the linearization approach. Sampling errors of the subgroup 4 and
more persons are really higher in BRR, DAGJK and EDAGJK than those obtained by
linearization.
Table 1. Relative sampling errors (%) for the mean equivalised disposable income by
linearization approach, BRR, DAGJK and EDAGJK methods (with 30 random groups
using formula 3.4)
Relative sampling errors %
Equivalised
DAGJK
EDAGJK
disposable Linearization BRR
(G=30)
(G=30)
income
(mean)
Total
16504
0.62%
0.68%
0.76%
0.75%
By household size
1 person
14788
1.31%
1.34%
1.11%
1.28%
2 persons
17611
1.08%
1.16%
1.46%
0.99%
3 persons
18094
1.18%
1.29%
1.32%
1.43%
4 and more persons
15924
1.30%
1.81%
2.17%
1.71%
Source: EU-SILC 2005
Table 2 reports the results obtained using basic weights and thirty random groups for
DAGJK and EDAGJK methods. The obtained relative sampling errors are, in general,
a bit higher than those showed in table 1 for EDAGJK while they are nearly equal for
the DAGJK.
These findings are not surprising because the calibration does not improve seriously
the reliability of a ratio estimate.
Table 2. Relative sampling errors (%) for the mean equivalised disposable income by
linearization approach, DAGJK and EDAGJK methods (with 30 random groups using
basic weights)
Relative sampling errors %
DAGJK
EDAGJK
Linearization
(G=30)
(G=30)
Total
By household size
1 person
2 persons
3 persons
4 and more persons
Source: EU-SILC 2005
0.60%
0.79%
0.85%
1.08%
0.93%
1.16%
1.18%
0.92%
1.38%
1.50%
1.99%
1.45%
1.04%
1.46%
1.75%
5.2 The Number of the Random Groups
The number of the random groups used in DAGJK and EDAGJK methods influences
the results. For this reason, the use of thirty random groups was compared with that
of one hundred random groups in DAGJK and EDAGJK. Table 3 reports the
sampling errors obtained by DAGJK and EDAGJK - with thirty and one hundred
random groups - and by BRR.
The results underline that more random groups don’t produce very different relative
sampling errors, however DAGJK - using one hundred random groups - produces
results closer to BRR, at least for the total. Nevertheless this improvement in
sampling errors is quite bounded and it requires more work from a computational
point of view.
Table 3. Relative sampling errors (%) for the mean equivalised disposable income by
BRR, DAGJK and EDAGJK methods (with 30 and 100 random groups using formula
3.4)
Relative sampling errors %
Total
By household size
1 person
2 persons
3 persons
4 and more persons
Source: EU-SILC 2005
DAGJK
(G=30)
DAGJK
(G=100)
EDAGJK EDAGJK
(G=30) (G=100)
0.76%
0.74%
0.75%
0.78%
BRR
0.68%
1.11%
1.46%
1.32%
2.17%
1.43%
1.29%
1.45%
2.01%
1.28%
0.99%
1.43%
1.71%
1.50%
1.28%
1.22%
2.06%
1.34%
1.16%
1.29%
1.81%
5.3 Using different replicate calibration correction factors in DAGJK
The sampling errors obtained by DAGJK with the replicate weights, calculated by the
standard way (formula 3.3), were compared with those obtained by DAGJK based on
the replicate weights calculated using formula 3.4. In this second case, the replicate
g-weights are adjusted in order to be fairly close to the associated calibrated weights
(Kott, 1998a, page 766).
Table 4 shows the main results using DAGJK - with 30 and 100 random groups and
respectively formulas 3.3 and 3.4 – and BRR.
For 100 random groups, formulas 3.3 and 3.4 provide sampling errors closer to those
obtained by BRR; for 30 random groups, formula 3.4 estimates values closer to BRR
than formula 3.3.
Table 4. Relative sampling errors (%) for the mean equivalised disposable income by
DAGJK method (with 30 and 100 random groups using replicate calibration
correction factors, formula 3.3 and formula 3.4.) and by BRR
Relative sampling errors %
DAGJK
DAGJK
DAGJK
DAGJK
BRR
(G=30)
(G=30)
(G=100)
(G=100)
(3.3)
(3.4)
(3.3)
(3.4)
Total
0.91%
0.76%
0.79%
0.74%
0.68%
By household size
1 person
1.69%
1.11%
1.48%
1.43%
1.34%
2 persons
1.46%
1.46%
1.25%
1.29%
1.16%
3 persons
1.82%
1.32%
1.41%
1.45%
1.29%
4 and more persons
2.32%
2.17%
2.16%
2.01%
1.81%
Source: EU-SILC 2005
5.4 A Special Case of EDAGJK Method
A special case of EDAGJK was experimented. With reference to the non selfrepresentative sampling units (NSR), the application of the EDAGJK method - using
741 random groups - was compared with that implemented using 30 random groups
where in both cases the replicate final weights were computed by formula 3.4.
This is a special case: the application of the EDAGJK method - for NSR using 741
random groups - corresponds to the application of the Jackknife in which the
replicate basic weights are computed using formula 3.2.
Table 5 shows the results; the application with 30 random groups (that is easier from
a computational point of view) estimates sampling errors that are closer to those
obtained with 741 random groups.
Table 5. Relative sampling errors (%) for the mean equivalised disposable income by
EDAGJK - for NSR units - using (3.4), final weights, 30 and 741 random groups
Relative sampling errors %
EDAGJK (G=30) (NSR)
EDAGJK (G=741) (NSR)
Total
By household size
1 person
2 persons
3 persons
4 and more persons
Source: EU-SILC 2005
0.80%
0.77%
1.55%
1.24%
1.61%
1.70%
1.57%
1.35%
1.46%
1.52%
References
De Vitiis, C., Di Consiglio, L., Falorsi, S., Pauselli, C., Rinaldelli, C. (2003), “La
valutazione dell’errore di campionamento delle stime di povertà relativa”, final report
presented at ‘Povertà Regionale ed Esclusione Sociale’, Roma 17-12-03.
Falorsi, S., and Rinaldelli, C. (1998), “Un software generalizzato per il calcolo delle
stime e degli errori di campionamento”, Statistica Applicata, 10, 2, pp. 217-234.
ISTAT (2005), GENESEES 3.0 Manuale Utente e Aspetti Metodologici.
Kott, P.S. (1998a), “Using the delete-a-group Jackknife variance estimator in
practice”, National Agricultural Statistics Service, Washington DC, pp. 763-768.
Kott, P.S. (1998b), “Using the Delete-a-Group Jackknife Variance Estimator in NASS
Surveys”, RD Research Report No. RD-98-01. Washington, DC: National Agricultural
Statistics Service.
Kott, P.S. (2001), “The Delete-a-Group Jackknife”, Journal of Official Statistics, 17,
591-526.
Kott, P.S. (2006), “The Delete-a-Group Variance Estimation for the General
Regression Estimator Under Poisson Sampling”, Journal of Official Statistics, 22,
759-767.
McCarthy, P.J. (1969a), “Pseudoreplication: further evaluation and application of the
balanced half-sample technique”, Vital and Health Statistics, Series 2 n.31, National
Center for Health Statistics, Public Health Service, Washington, D.C.
McCarthy, P.J. (1969b), “Pseudoreplication: Half-Samples”,
International Statistical Institute, 37, pp. 239-264.
Review
of
the
Moretti, D., Pauselli, C., Rinaldelli, C. (2005), “La stima della varianza campionaria di
indicatori complessi di povertà e disuguaglianza”, Statistica Applicata, Vol. 17, N. 4,
pp. 529-550.
Pauselli, C., and Rinaldelli, C. (2003), “La valutazione dell’errore di campionamento
delle stime di povertà relativa secondo la tecnica Replicazioni Bilanciate Ripetute”,
Rivista di Statistica Ufficiale, 2/2003, pp. 7-22.
Pauselli, C., and Rinaldelli, C. (2004), “Stime di povertà relativa: la valutazione
dell’errore campionario secondo le Replicazioni Bilanciate Ripetute”, Statistica
Applicata, Vol.16, n.1, pp. 89-101.
REGULATION (EC) No 1177/2003 of the EUROPEAN PARLIAMENT and of the
COUNCIL of 16 June 2003 concerning Community statistics on income and living
conditions (EU-SILC), 3.7.2003 L 165/1 Official Journal of the European Union.
Rinaldelli, C. (2005), “Statistiche complesse e software”, Statistica & Società, Anno
III, n.2, 01.2005, pp. 27-29.
Rinaldelli, C. (2006), “Experiences of variance estimation for relative poverty
measures and inequality indicators”, COMPSTAT Proceedings in Computational
Statistics, pp. 1465-1472, Italy 2006.
Rust, K. (1985), “Variance Estimation for Complex Estimators in Sample Surveys”,
Journal of Official Statistics, 1, 381-397.
Särndal, C.E., Swensson, B., Wretman, J. (1992), Model assisted survey sampling.
New York Springer-Verlag.
Wolter, K.M. (1985), Introduction to variance estimation, New York Springer-Verlag.
Zannella, F. (1989), Tecniche di stima della varianza campionaria, Manuale di
tecniche di indagine, Collana ISTAT Note e Relazioni, anno 1989, n.1, vol.5.