“Sample Estimation of Two-Sided Matching Models”
Transcription
“Sample Estimation of Two-Sided Matching Models”
“Sample Estimation of Two-Sided Matching Models” John Allen Logan, Univ. of Wisconsin-Madison Extended Abstract for RC28 Meeting, May, 2014 The two-sided matching models of Logan (1996; AJS 102:114-160) and Logan, Hoff and Newton (“LHN”, 2008; JASA 103(482): 559-569) allow for statistical inference about the preference structures underlying job-worker or marital partner matches observed in populations. In particular, the models provide a behavioral explanation for the patterns seen in mobility tables, and promise to allow counterfactual predictions of the effects of changes in supply and demand of different types of partners (employers, workers, spouses). The present paper presents tsreg, a new, user-written Stata command for estimating the LHN form of these models using individual-level data on both sides of the market. The tsreg command allows a researcher to process ordinary data files containing, e.g., characteristics of married partners and single men and women. It allows the researcher to specify two-sided preference effects using ordinary Stata syntax, and automatically generates the NxN data matrices used to implement LHN’s Bayesian Markov chain Monte Carlo (MCMC) estimation algorithm. The output is sampled values from Markov chain scans, which can be summarized and graphed with included Stata routines to produce Bayesian estimates and intervals for the preference parameters. Here is a tsreg command for one of the estimations to be described later: tsreg {x: xraw1[j]==yraw1[i]==1 } {y: yraw1[i]==xraw1[j]==1 }, estimation xvarnames(xh1) yvarnames(yh1) alfa_vals(0) beta_vals(0) xraw(xraw1 xzero) yraw(yraw1 yzero) xflag(woman) yflag(man) caseid(match_id) jobname(h7003) blksize(16) nojumble scans(20000) dens(100) seed(-1) ; The paper also extends the LHN analysis in several ways. A new likelihood function for the model represents the probability of the observed matching as a function of preference orderings rather than utility values. This function is more computationally intensive than LHN’s method, but was used to program an ML estimator for systems of size N=2, that is, two men and two women. Parallel estimates of parameters using the ML and MCMC algorithms show very close agreement, as in this figure: Logan, p. 2 The horizontal reference lines at about 0.1 and 0.75 are the MLE’s for the two parameters, while the wiggly traces are corresponding MCMC scans. Experimentation with tsreg and the program for size N=2 systems uniformly showed the Bayesian estimates closely tracking the MLEs. This experiment provides a kind of validation of the Bayesian MCMC method (as implemented in tsreg and when using diffuse priors) because the ML approach is more transparent, and also non-stochastic. There is no prospect of using ML for larger systems, however. LHN asserted that their method could identify separate effects of preferences on the two sides of the market. For example, LHN implicitly claimed to be able to determine whether men’s preferences for women’s educations, or women’s for men’s, were responsible for educational homogamy, and to determine the relative strengths of preferences on two sides when both preferences operated. Taking advantage of a tsreg simulation option allowing stable matches to be created in synthetic populations for any desired specification of preferences, the present paper investigates this identification question (and others) by performing estimations on data generated from known parameter values. The results support LHN’s claim. Here are results from seven replicates using systems of size N=32: Estimates of Two Sides' Preferences for Same Characteristics in 7 Replications Mean Preference Estimates in Systems of Size N=32; Dashed Lines Are Parameter Values .4 .6 .8 a1bar 1 b1bar The dashed lines in the figure are the parameter values for the preferences, α1 = 0.5 and β1 = 0.75. More extensive results will be presented, including experiments investigating the separate identification of same-status preferences (e.g., preferences for a spouse of the same religion) which find such separate preferences are only weakly identified (if at all), as LHN had suggested. Models with constant terms, continuous explanatory, and same-status dummy explanatory variables, in various combinations, have been estimated. The remaining parts of the paper consider the problem of estimating LHN’s matching model on sample rather than population data. LHN, like other investigators of two-sided models, used samples of data from large populations to demonstrate their estimation method; LHN expressed a hope that using a sample would not bias their results. The LHN model describes a single probability for an observed matching comprising all the members of a population. These probabilities cannot be decomposed into independent probabilities for the matches obtained by the various members, since the preferences of the system members all interlock to determine the outcome. Each matching of an entire population constitutes a single case. One kind of sampling that might be considered is sampling complete matched systems from a larger population, Logan, p. 3 as might be imagined for a marriage study conducted over similar, isolated communities. The likelihood function of such a sample would be a product of the independent outcomes in the different populations, and general properties of ML or Bayesian inference would argue for consistency. The type of sampling that is most relevant to sociological research is different. Matching populations tend to be large, with porous and hard-to-define boundaries. It is desirable to have models that can be estimated on samples from such large matching populations. This might alleviate the insurmountable problem of defining the exact boundaries of the population relevant to each particular actor, and could also make the computational burden tolerable. So the problem of sampling from a matched population is key to practical applications of LHN’s model. When considering sampling from a matched population, that is, one that is in a stable matching as assumed for the LHN method, it may first be observed that any sample of two or more matches (i.e., matched pairs and singletons) from a stable matching must itself be a stable matching when considered in isolation. (This point will be elaborated.) This observation suggests basing estimation on random samples of matches from a population, treating each such sample as a separate, stably matched system. This is the procedure that was used by LHN. The paper uses two approaches to consider the properties of such estimates based on randomly sampled matches. First, consistency of the estimator is considered analytically for the two-sided logit special case of LHN’s model, using the new likelihood function that was the basis of the ML estimates described earlier. In the logit special case the likelihood for the stable matching of the sampled subsystem reduces to one based on the sampled data alone, if the uniform conditioning property described by McFadden (1977; cf. Ben-Akiva and Lerman, 1990) applies to the sampling method. A consistency proof for this reduced version of the likelihood seems possible. The second approach is to simulate large populations that are stably matched with known parameter values, and then examine estimates obtained from applying the LHN method to random samples of different sizes. Ideally, the simulated stable population would contain a very large (really, infinite) number of matches, but in practice I have limited simulations to systems of size N = 4096 (a total of 8192 actors). The next figure shows some results of estimating the parameters of this population using random samples (with uniform conditioning) of sizes 8 through 128. The large, paired dots represent sample means for the two parameters in the model, α1 and β1, which again have population values of 0.5 and 0.75, represented by the dashed lines. The model is the two-sided probit special case. Logan, p. 4 Estimations From Size N=4096 System Using Various Block Sizes 0 Coefficient Value .5 1 1.5 2 2.5 Dashed lines indicate parameter values. 8 16 32 64 128 256 512 1024 Estimation Block Size (N of actors on each side) alpha 1 mean 2048 4096 beta 1 mean The figure shows the estimates approaching the parameter values more closely with increasing sample sizes, at least up until N=128. It also seems plausible, judging from the decreasing rate of approach, that the estimates are approximating an asymptotic approach, ignoring the finite population size. That is, these two-sided probit estimates give the appearance of consistency in this simulation. (These particular simulations use same-status explanatory variables rather than continuous ones. For this reason, the magnitudes of the effects on the two sides are not clearly differentiated.) (I have yet to estimate the two-sided logit special case with a corresponding series of samples, but this will be included in the final paper.) The rather large biases seen in the estimates using very small samples of matches have implications for practical two-sided matching model estimation. The tsreg Stata command includes a feature by which large data sets can be broken randomly into smaller subsamples, and the LHN method applied within the subsamples. The result is much faster estimation than could be obtained by analyzing entire data sets as single, matched populations. The simulations shown here, however, suggest that this gain in speed of estimation (which can be very substantial) needs to be weighed against the apparent biasing effect of small subsamples. More investigations need to be done to give a fuller description of the likely size of the bias in realistic situations. Summary: A new Stata command, tsreg, implementing LHN’s method for two-sided matching model estimation is presented and demonstrated. A new likelihood formulation is presented and used to validate the Bayesian MCMC estimates of tsreg. Stable matched populations are generated with tsreg for given parameter values, and the resulting data sets are used for estimations that show the ability of the model to differentiate effects on the two sides of the market. Estimation using samples of matches from a stably matched system is investigated both analytically and by simulation studies. Analysis using the new likelihood formulation suggests estimation with the two-sided logit form of the LHN model is consistent. Simulations show changes in the bias of estimates that suggest consistency, using samples of different sizes from a large, finite population. Implications are drawn for the appropriate use of subsampled LHN estimation with tsreg.