Document 6530025
Transcription
Document 6530025
MACROS FOR SYSTEMATIC SAMPLE SELECTION AND VARIANCE ESTIMATION FROM ORDERED FRAMES Josefina Lago, Westat, Inc. 1. Introduction This paper discusses both equal probability and PPS systematic sample selection, as well as the basic formulation of vari- Systematic sampling is one of the most commonly used methods of sample selection, particularly at the second and latter stages -ance estimation by "successive differences". Also described are the three MACROS developed to implement the sample selection and variance of selection of a multi-stage design. Systematic sampling's greatest advantage is its sim- i estimation of an estimated total. The use of the MACROS--SYSSAMPl, SYSSAMP2, and SYSVAR--is illustrated by means of an example. plicity. Another advantage is that, under certain conditions, systematic sampling variances are often smaller than those from alternative designs. (As survey practitioners well know, it is safe to use systematic sampling only when one is sufficiently acquainted with the data to determine when systematic selection is not appropriate, as for example, with data that are periodic in relation to the order of the listing and the selection interval is equal to or a multiple of the period.) 2. Equal Probability Systematic Selection To draw an equal probability systematic sample we first compute the "selection interval," k: k =N/n where n = desired sample size; and N The major shortcoming of systematic sampling--with a single random start--is that it does not yield an unbiased estimator of variance. A systematic sample may be viewed as a sample of one cluster, where a sample of size two or more is generally needed to construct an unbiased estimator of variance. However, several biased estimators are available that provide satisfactory variance estimates for many situations where systematic sampling is used in practice. total number of units in the frame. Of course, the value of k is not necessarily an integer. Typically, the value of k is rounded off to one or two decimal places when an exact sample size is required. Next a random number, referred to as the "random start" (r), is chosen between 0 and k (excluding 0 but including k). The sample units are those having positions on the list corresponding to the integer portion of r, r+k, r+2k, ... r+(n-l)k. An unbiased estimator of a population total, Ysy , estimated from an equal' probabIlity systematic sample is given by: This paper focuses on systematic sample selection and variance estimation when a frame has been ordered on the basis of an auxiliary variable presumed related to the population characteristics of interest. A systematic sample selected from such ordered list provides a kind of implicit stratification with equal or unequal sampling fractions, depending on whether equal n 't sy kEy. i=1 1 n 1: y! 1=1 1 where k y~ 1 probability or probability proportional to size (PPS) systematic selection is used. In this situation, a variance estimator commonly referred to as the "successive differences" estimator may be constructed by regarding each sample unit as selected at random from a stratum. the inverse of the inclusion probability of each unit; and weighted value of the ith sample observation of variable Y. MACRO SYSSAMPI implements an equal probability systematic selection as described above, and MACRO SYSVAR may be used to estimate the variance of a total estimated from such a sample. 764 3. The random start, r, in PPS selection is a random number between o and k (defined above). The n' selection numbers are then: r, r+k, r+2k, •.• , r+(n'-l)k. The unit to be drawn into the sample corresponding to a given selection number is the first unit on the list for which the cumulative size, is greater than or equal to the selection number. PPS Systematic Selection The PPS systematic sampling scheme is widely used when one expects a proportionality between a size variable (or "measure of size"), Xi, and the characteristics of interest. With this sampling scheme the inclusion probability of the ith unit is proportional to Xi. Mi, An unbiased estimator of a population total estimated from a PPS systematic sample is given by the Horvitz-Thomson estimator, YHT: Assuming the units in the frame have been arranged in the desired sequence, the PPS selection is carried on as follows. Pirst, a cumulative measure of size, Mi' is calculated for each unit in the frame. This cummulative size is simply the measure of size of the ith unit, say Xi' added to the measure of size of all preceding units in the list, M(i-l)' n L i=l bility of the ith unit in the sample (i.e., 1 for certainty selections and Xi/k for noncertainties) : i J and the total of all measures of size MN is given by: n N = 1 total sample size; inclusion proba- n 'i E x. M 1 where That is: j=l y./n. weighted value of Y for the ith sample unit. y! 1 E Xi i=l MACRO SYSSAMP2 implements a PPS systematic sample selection. MACRO SYSVAR may be used to estimate the variance of an estimated total once sample observations have been properly weighted. Next, the selection interval, k, is calculated by: N k ( I X.) In MN/n. i=l 1 Obviously, k is not necessarily an integer and, as mentioned earlier, it is usually rounded to one or two decimal places. 4. When sampling systematically from an ordered list, as described above, one can view the units associated with the first k ~ossible selection numbers as constituting a first stratum, those associated with k+l to 2k as constituting a second stratum and so on, and finally those associated with (n-l)k+l to nk as a last stratum. An estimate of variance may be obtained by regarding each unit in the sample as selected at random from a stratum, and grouping all possible pairs of sample observations from contiguous strata. The "successive differences" estimator of the variance of an estimated total based on the overlapping differences is given by: It -is possible that some units in the list have a measure of size greater than the selection interval k. These units are selected with certainty and are deleted from the list prior to sample selection. No bias in survey estimates is introduced because of the prior selection of certainty units (n c )' as long as selection probabilities of the noncertainty units are adjusted to reflect the excluded certainty units. Hence, k is redefined as: k = ~,/n' n' = where MN ,= variance Estimation by Successive Differences noncertainty sample size; and total cumulative measur~ of size after the nc units have been excluded. Vsy 765 n 2(n 1) n-1 I -i=l (yl _ y.1 1+1 )2 It must be remembered that before using MACRO SYSVAR, values of the variables for which vaeiances are requested must be properly weighted. where y! the weighted value of the ith observation in the sample; and n the effective sample size (i.e., number of noncertainty units in the sample). 1 6. 5. To illustrate the use of MACROS SYSSAMPl, SYSSAMP2 and SYSVAR, consider the following example. A state's department of education wants to estimate chaeacteristics of private schools in the state, such as total employment (EMPL), square feet (SQFEET), electricity (ELECBTU), and total energy consumption (TOTBTU). A sample of 80 schools is to be selected by two alternative selection procedures: equal probability, and PPS. The measure of size to be us-ed is the previous year1s reported student enrollment (ENRLMNT). For illustration purposes, let's assume the list of schools is also to be sorted by reported enrollment (ENRLMNT), to achieve a size stratification effect. Implementation To use MACRO SYSSAMPI the following user-defined MACROS are required: MACRO _STRVAR: MACRO MACRO SAMPSZ: DSFRAME: MACRO DSSAMP: variable to be used in implicit stratification sample size desired input data set containing the sampling frame output data set containing the selected sample. Since the universe size (N) is determined by the number of observations in the frame, any observation with a missing value for the auxiliary variable ( STRVAR) should be deleted prior to sample selection. Exhibits 1, 2, and 3 illustrate the set-up, listing, and output of the three MACROS, as required for the example described above. References To use MACRO SYSSAMP2, a user-defined macro must be included to identify the measure of size variable (MOS) , in addition to the MACROS required by SYSSAMPl. That is: MACRO _MEASRSZ: name of the MOS variable to be used in the.PPS selection. (1) Cochran, William G., 1977. Sampling Techniques. John Wiley and Sons, New York, Chapter 5. (2) Hansen, Hurwitz and Madow, 1953. Sample Survey Methods and Theory, John Wiley and Sons, New York, pp. 502-15. Units with measure of size greater than .75 times the selection interval are identified as certainties. Then, the selection interval k is redefined and the noncertainty sample n~ selected. MACRO SYSVAR requires the following user-defined macros: MACRO DSNAME: MACRO VARLIST: MACRO __SAMPSEQ: Illustration input data set containing the sample data: list of variables for which variances are to be computed; sort variable to be used to order the file with the sample data in the same oeder as in sample selection. 766 EXHIBIT 1 SAMPLE OUTPUT IlIllfIIlIIHlllflltllflltIlIJlIIJIffll)llflll,nJffflf.ftUHIHHlH II TO I~I.IOklE M~:RC SYS$~Mn " B~E REQUIRED: hRCRO _STRUHR II II II "~CRO II :H£ F~LLC1WC 1jSER-:'~f!:~£r ~flCP.OS R IIJ!lWJL[ 10 BE USE] IN IPlPLlCIT STRRTTrTCATlON I _SAMPSZ B (SAIWLE Sll[ D£SIRED I II 11 D IUPIJT DATA SET CONTAINING THE 11 II II II II NIltRO _DSfRIlliE II HIltRO _ISSRIll' EQUAL PROBABILITY SAMPLE SELECTION THE UNIVERSE SIZE IS : 589 :HE SAmE SIZE IS : 88 THE SELECTION INTER~Rl IS : 6,36 RANDOM SHIRr IS: 3,86 I; 11 II 11 I; fRf!ffE ) 'H, EQUAL PROSAlILl1'l' SRMPLE SELECTION NUHEERS ORIGINRl UHIT HUmp. or SANPlE UNITS AND VARIABLE USE] fOR INPLICIT STRATIfiCATION SE~UENCE II I' , E IOUTPUT lATA SET .CONTAINING rH£ .. SELECTElJ SAMPLE I .. .. II I; II II II II HOTDSINeE THE UNIUElISE SIZE INI IS DETERMINE! BY THE NUNBER Of .. OBSER'JATIONS IN THE fRAHE. ANY OBSERVATION iITH AHISSING n VALUE FOR THE AUXILIARY VARIAIU _STRUHR SHOULD IE U DEllID BEfORE SAHPl[ SELECTION II _SAIIPsrg 1 3 4 S 6 7 B IllnIHIIIIHIIHIHHHHlHHHIIHHHHHHlHHIIHIIHfHHlt; OPTIONS MOIATE i ~.IItRO SYSSIIIII'! ppoe SORT lATA ~. _ISfRIlliE OUT~IDIi BY _STP-VAR i mR DI1(t.[Ef'=t!UHIU) IlllKITPo _SmRR UNiLNOI SoT III END~NDOr ; UNIUO~_N_ ; NUNn'~_H_ ; OUTPUT IIIi IT mOf ~ I THEN OUTPUT 111-; D,Tf Dll; SET DI1 ; ,SA!P: _SR!PSl ; 69 Ii 71 72 73 74 75 mPlNT~Nl~IU/NSIIIIP; SUPIMT;P'ErtJND ISKIPINT, ,111; HlIlIST: ISKIPINT I UNIFO~:HI.8) I • ,8~ i ,·ItNIST:!OUND IRflHIST" 811 i I,m _HULL ; SET DI~ FIlE PRINT NOTlillS; PUT _PAGC /1/1 14B 'EQUAL PPOBAIILlH SRIIPLE SEl[CTION' 11111 1// m ' THE UNIVERSE sm IS : ' HUNIU /II !18 ' THE SA!PLE SIZE IS : ' MSR!P /II @l8 , THE smCTIOlt IHTEIi','Rl IS : ' SKIPINT III m ' THE RANDOM START IS: ' RANIST II; DAm 110 SET IDo 10 J;I TO NSR!P; UNILNO=!HII RANDST' II HI I SKIPINT II OUTPUTi END; KEEP liHIT JlO ; DATA JlSHSRKP; HERGE 112 11!!:IN1) III 11«91l11 IF IHI RNI 1N1; BY UNlT JlO ; IF INI RNI IN1 THEN _SA!PSEQ' I; PROC PRlhT; TITlE EQUAl PROBAIILIff smlE SELECTION; TmE1 SEIlUEHCE 1U~IERS ORIGIHRl l'~1T r,U!l£F. Oi ,RN?lE U~ITS; TITLE3 ~NI VARIAILE USED F':f: 1;;L1CIT SlF.P.TIFIC'TIOH ; ID _SAMPSEl; V~R mUD _sm.'" ; . % . "P.CROS- MACRO _S!RI,lAR ENRIJIlT % ~p'cR~ _DSFRA~[ IHIt. SCHOOL NIltRO _,AHPSZ 81 % NACRO .BSSIIIiP OUII % SYSSAMPl 1/ 71 79 8e % HACRO SYSSAHP I -- 3 IB 16 11 29 35 42 48 [HRL!NT 133 134 13. 134 235 236 237 237 436 441 449 783 783 783 461 46B 474 78+ 73S 185 785 487 785 897 SS6 m m m 4~9 Sij£ 373 275 EXHIBIT 2 JUJUtUlttflltffllllllltHfllltllUl!llifl.IIHIUllllllfillUJiUII II II II II II II TO W,IQI<E ~RCfO m {:[QUIREI: SYSSA!P1 THE FBLLOWING USER-!EflNEI NACROS II . NACRO _STRUHR A I'JRRIAIL[ TO IE USE] IN IHPLICIT STRATIFICATION I II HOCRO _SAHPSZ B ISRHPLE mE BESIRED I II !ACRO _ISFfI1l1E D ( DI\1& SIT NANE CONmlNING THE II '; II FRI\!iE ) II MACRO _DSSA!~ II II MACRO _HEASRSl n II '; II I( .. Ij II I; II II II u---II~IMIHG R[~I)IF..~!1 ll~lHVOK1NG 16 77 UIlIUll E IOIJTPlJT !AlA SET CONTRINING THE n II SELECTED SAHPLr I I; f INERSURE or SIZE IJARll!lIlI 10 H IE USED IN PPS SELECTION) '( Ii u 'OI[:SINCE THE UNIUERSE SIZE II.I IS lETERHIIlEJ IY THE IIJHIER Of II OBSER'JATIGNS IN THE fRR~G ANY OBSERVATION WITH AHISSING 'JALUE FOR THE AUXILIARY 'JHR1ABLE _SWJAR SHIJUIJ! BE BEllITl. BEfORE SAHPLE SELECT! ON IT U U II II II 'i" '( " ,OT[: THE C[RTAINT'I CUTOff IS .7SI (SELECTION INTER\'Rl IKI) u I; I; ~ ttfilHHIHHIHHIHlllHElJl-HiHlUHHIHIHHHHHHlHBHHi 767 fI--fROGRRlI CODUOR !ffCRO SYSSfiltPMACRO SYSSRlif'~ II ••, ••, PROC SORT IAT~ ~ _DSfRflME OUT~DDl; IY _STF,~RR ; DAm DI2ItEEP=TOTHOSI ; SET 111 EHI'i:NIOr; IOIMOS + ~~[flSRSl ; IF E~]OF ~ 1 TII8', OUTPur ; ),TA IIIIOROP ~ wnm, TOTMOS GERDIOSI D13; MERGE DDI DDo RETAIN GEm,OS ; UNILNO= _H_ ; IF _H_ ~ 1 THO! 10 ; CEP.T!OS.JiOUNI( ITDT~OSI _SAIIPSl I , .BI i 1I---'EiUIREI USEP.-lEFlNEU HRCRO$- ••, II ~HCfO _DSFRAME A % ~fiCRO _Sfit!?SZ 88 % I,ACRO _m~AR ~RCRO DSSRXP OUT! % smslZE % !HCp.e _mSRSZ ENf:L!NT ), II •• iI---IN'!OKING RACRO S'tSSR!PI .n SYS5R:,\P2 '1 END; IF J\EASRSZ GT .71ICERll\OS lllEN DO ; CERTRIN ~ 1; OUTPUT 013 ; DiD; ELSE OUTl'UT DDI ; PROC PRINT DATfl~I'3 ; IIllE PPS SfiltPLE SELECTION; 1IlLE2 CERTAINTY CASES; II mmlNi VAR UHITYO _~[ASPSZ ; n~TR' DD3(Y.EEP:;(HTPIN1; IF E!llJOF~1 ltAfP. H4 ~:[T SAMPLE OUTPUT PPS SA~f'lE ~:[lECTIOH CERTAINTY CASES If _N_ ~ j 331 332 333 1695 269B 3267 PPS SAIPLE SELECTION THE r.U!BER OF HOH-(;E£TRINTY CASES IN FRAHE IS THE NON~;ERTRINTY SR!PLE Sill IS 77 THE SElECTION INTEP.'JAL IS 1109.13 THE RRNDO! STRRr IS 1613.889 338 PPS SA!PLE SELEtTlON NINCEliTAINTY SR!PLE SELEtTTOIIS UNIUO llumms TJj[ LINE HUM OF THE SR!PLE UNIT IN THE _S~PSEil I 1 1 4 5 6 7 8 9 MEf'GE PD5 !lD4 j 11 IB PI'T _PReL I I /1/ !69 'PPS SA!PLE SELECTl ON ' III II 11; 'THE NIJHBEP. Of H['N-tERTAINTY CRSES IN FRAHE IS 'NOHCEP.T I f l m 'THE NON-{UTRINr( SAMPLE SIZE IS' NSA!P Ifl m 'Til[ SELECTIOh IHTERl'f!l IS ' INITP.URL /II !JB ' ThE RANIOH START IS 'RANDOH; 'P,)[ PRINT DATA~IJ' ; TITLE PPS SAMPLE SELECTION; '!TLE1 hONeERTAINIY SAmE SELECTIONS; TlTm UHILNO lDENTIFlE'3 THE LINE HU! OF THE SftHPlE UNIT IN THE FRAIi£; I) _SA~PSEQ; 'JftR _!ERSRSZ IJNILNO CUH!OS ; I'm Ai SET INA. SCHOOL SC~L812[; EHRL~Hr j I; 11 11 13 11 IS 1f 7B 71 71 73 71 7S I 7. 77 ~ ~ ,t 1 3 tISH~F') j 'I L[ ?W.T NOTl!LES ; ! i, ! EHRLMNT Bn n1D=EHx,r; (r[EP:TOT~OS ~[;HCERT _HULL UNIT JlO meN [uTPUTi SET 111 ENI=l:HIOF ; TOTIOS + _~EASRSZ i HOHCERT + I; IF ENIOF = 1 THEN OUTPUT ; lATA mil. ; meE III mill ; RElAIN RfiHDOI INTER'JAL ; IF _N_ ~ I THEN DO ; IF CERTAIN HE • THEN "SA!P~ _SR!PSZ - CERTRIN ; ELSE nSA!P~ _IAr-PSI ; INTEP.VAL = ROUND I TOT!OS/NSA!P, ,ell ; y.AHIO~ = I INTERVAL I UnIFORHIBI + .i9 ) i CUmAS ~ INT ( RANIa! I j ,QUTPUT DIS ; ENI; W!HEAS + _HERSPSZ j !IV~ INT ICUM!ERS IIHTEP.l'AI. I ; CU'HEAS ~ !OD ( CtH!EAS, IHTERUAL I C"!OS + _MEASRSl ; If DIU GT e THEN DO ; _SRIPS[Q + 1 ; QUTF'UT libi . EHD; I'~TP mTRIN 768 ENRl!NT UNIUO 133 23113. 237 1 18 19 17 35 43 51 m 254 255 15. 263 161 165 265 166 267 167 276 BBI 812 824 865 867 873 874 839 <, "6b 74 81 89 96 183 111 118 312 315 317 31B 311 3,4 326 329 CUMOS 1&. 133. 1m 6342 325B "2I~ 11<£3 14329 16158 18269 291" 22.24l 1I1ac 15968 2S1e4 36B19 136253 138631 14e32S 141839 141613 146351 148182 15~7~ ,Fem: EXHIBIT 3 UIUIIIIUIIH fttill J"I 1111i'Ji'UUIII HI If n l t' fli JIU '1IIJHIUn H I; n TO !H'JG~E ~PCR(I USti~~ T~[ rOLLClo!J~G USE~:-D~tI~n ~HCROS n m REQUIRED: .. MHCRO I; II _!SIl~!E R( INPUT !ATA SET DITR I u CO~IIAINmG THE II {lRT~ . or UARIRILES FUR WHIC~ 'IHRIRNCES ARE TO tE CIl!PUTED) F lUST _'.IPRUST : .. C ( S~RT l,!ARlfiBLE TO BE USED TO H ORDER THE fiLE AS IN n SELECTION .. SA~FLE ) URP.lBL:: RLBl i Ii Ii c"lEF _'JAR 'ReU ; ':lfP HU!_OBS ESmATE UARIAHC[ :'UTPUTi [J II iDlE: I:EFOF;[ 1!3IHG THIS FROC£DURE THE TIATA HUST SE PP.CPERD' WEll;HED [J JlLILLi. SET DM j F'LL PRINT HQTlTLES ; :r _KA THEH PUT _PACL III H1 'SUCmmE DIFFERENCES UA'IR~CE [SmATION PROCEDURE' ,38 'NU!BER OF' !6! 'ESmATEII' 1115 'COEFfICIENT OF' I ['Ar~ ~~(RG SYS~AP. r'DU SET _DSNAKE i f£EP )J8P.LIST _$P.~PSEQ ~s '1,IRRIASlE LABEL' :~!18 'UflP.HtTIOI'f' l j ~ ~X SO~T [;~:B TIflTF=!IDl; R'f _SF,XPSEQ ; DDI; 3ET ~Dl ; ~5 !' ",'AY REST J'RRllST; ~fJAl ,DIF I'lFHIF2S DO D'JER Am; ,IIF' IIIIF!AESTl "1) ~'~1 @62 'TOTRl' @92 J~IARIAHC[, 14['_' @36 141 ,., ~59 111 I_! @98 111'.' !115 141-'.' !Ie VARLBL E4e NUH_OBS 161 [smAlE I l.!ARIANCE @128 COELVHR 4.2 j " R j SET (tU.CmlPUAI?S ; I' L LT !se ~ND ELE(BTU HE • ; ; I"~TP ;'~RRV REST _VARUST ; ~~'RHY ~~,~:AY RSIGK S!GI'iI-SIGX25 ; , , II II C:UTPllT CUT =TIE3 n=H l-N25 SUM=Dlf1- U F2S i :;r~ ['n; ~ERGE DII2 IID3 i I' I' ~HCPi) .DStlRKE R iRCRO _~'~RUST ~HCRO ~ EHPL SgFEET _,"MPSEQ SITE % ELEmli TOlm , II II HIlIF mrl-HF25 ; RnUM Hl-H25 j ARRAY R[:U C'!! -(V1S ; 110 OIJER fiESTi RSIGR =AIllf' I RNU,~ .l(2I(P.HU~-t ) IF REST'9 nEH Rev, e ; EL~E ACi,I = 5QF:TlR51G~ l lHESI ; ENE; F'Roe PI.!NT ; ~·p'OC r;~U:IX ; FETCH X 1,Al~;; I ~;: !:.~ ~?',~~E =;;?~ I~.i I' .: II ~~'RK'f (~T<,~B_ @36 'ORSER1JHTIDNS' III "SE PUT II 118 IJRRLIL !oW NIJUIS 161 ESTlKHTE ~';i UARIANCE !11S COELUAR 4.1; EHDi ~EtC ,EAHS su~ DHTR=DDI WJPRINT i !':B~ J!ARL!ST i r1_;fPUT GUT=III'2 SIJ~ =_I,iARLIST j ~'P'::(' ~,EA~S N su~ DRTR=[I[I\ HOF'R!HT ; :,:fR nIFl-illF2Si OUTPUT I.iARE: VRRLBL ; COEUR~: ,~D; U H ~ ;IU'iIHIIIUIIHUIIJIHIUIIIUIIUIUIHIIIHHHfIlIIlHIlIlllllI, !'~T~ E=:UP(IP:~Q~l; ESmRTEoAEST ; IjARIRHCE,ASICK ; u II rtflCRO _SR~P-SEQ DD3 A~WI; NUK_OBS= I; II u r.u:;~[ ARRRY REST _~ARUSf ; ARRAY ASIGli :IG!1-SIGlI1) ; ARRRY Reu CIJI-CU2~ ; ARRRY RLBL $ B COLHOL1S; DO OVER REST I; ~RCRO II II In:: H,~A( F~l!~ ~1-N(5j ~~'['[<~~R; SAMPLE OUTPUT SlI(CES)lt!E DIFFERENCES IJARIANtE HI.l~BEP. i.iRR1ABlE LP.EEL or S9FEET mem 10mu PROCEDURE CT~H ESTl~ATE:' :jBS£F.I)RTIO~S ------------ Enf'L EST:~m"~ 67 13 13 13 TOTAL IJRF:!f-f1CE ---------- -------- 9694,5 4,~?,c;~q ,t:~ 769 ,. 2917144 %E.2132B!4Ba 618. ,'i, I ~9t53,25 1582. G 539659.5 ,~ [(LENT OF i.iF.FIATICIK --------- B.18 8.11 9.1a 1.46