“Sample Estimation of Two-Sided Matching Models”

Transcription

“Sample Estimation of Two-Sided Matching Models”
“Sample Estimation of Two-Sided Matching Models”
John Allen Logan, Univ. of Wisconsin-Madison
Extended Abstract for RC28 Meeting, May, 2014
The two-sided matching models of Logan (1996; AJS 102:114-160) and Logan, Hoff and
Newton (“LHN”, 2008; JASA 103(482): 559-569) allow for statistical inference about the
preference structures underlying job-worker or marital partner matches observed in
populations. In particular, the models provide a behavioral explanation for the patterns
seen in mobility tables, and promise to allow counterfactual predictions of the effects of
changes in supply and demand of different types of partners (employers, workers,
spouses).
The present paper presents tsreg, a new, user-written Stata command for estimating the
LHN form of these models using individual-level data on both sides of the market. The
tsreg command allows a researcher to process ordinary data files containing, e.g.,
characteristics of married partners and single men and women. It allows the researcher
to specify two-sided preference effects using ordinary Stata syntax, and automatically
generates the NxN data matrices used to implement LHN’s Bayesian Markov chain
Monte Carlo (MCMC) estimation algorithm. The output is sampled values from Markov
chain scans, which can be summarized and graphed with included Stata routines to
produce Bayesian estimates and intervals for the preference parameters. Here is a
tsreg command for one of the estimations to be described later:
tsreg {x: xraw1[j]==yraw1[i]==1 } {y: yraw1[i]==xraw1[j]==1 },
estimation xvarnames(xh1) yvarnames(yh1) alfa_vals(0) beta_vals(0)
xraw(xraw1 xzero) yraw(yraw1 yzero) xflag(woman) yflag(man) caseid(match_id)
jobname(h7003) blksize(16) nojumble scans(20000) dens(100) seed(-1) ;
The paper also extends the LHN analysis in several ways. A new likelihood function for
the model represents the probability of the observed matching as a function of
preference orderings rather than utility values. This function is more computationally
intensive than LHN’s method, but was used to program an ML estimator for systems of
size N=2, that is, two men and two women. Parallel estimates of parameters using the
ML and MCMC algorithms show very close agreement, as in this figure:
Logan, p. 2
The horizontal reference lines at about 0.1 and 0.75 are the MLE’s for the two
parameters, while the wiggly traces are corresponding MCMC scans. Experimentation
with tsreg and the program for size N=2 systems uniformly showed the Bayesian
estimates closely tracking the MLEs. This experiment provides a kind of validation of the
Bayesian MCMC method (as implemented in tsreg and when using diffuse priors)
because the ML approach is more transparent, and also non-stochastic. There is no
prospect of using ML for larger systems, however.
LHN asserted that their method could identify separate effects of preferences on the two
sides of the market. For example, LHN implicitly claimed to be able to determine
whether men’s preferences for women’s educations, or women’s for men’s, were
responsible for educational homogamy, and to determine the relative strengths of
preferences on two sides when both preferences operated. Taking advantage of a tsreg
simulation option allowing stable matches to be created in synthetic populations for any
desired specification of preferences, the present paper investigates this identification
question (and others) by performing estimations on data generated from known
parameter values. The results support LHN’s claim. Here are results from seven
replicates using systems of size N=32:
Estimates of Two Sides' Preferences for Same Characteristics in 7 Replications
Mean Preference Estimates in Systems of Size N=32; Dashed Lines Are Parameter Values
.4
.6
.8
a1bar
1
b1bar
The dashed lines in the figure are the parameter values for the preferences, α1 = 0.5
and β1 = 0.75. More extensive results will be presented, including experiments
investigating the separate identification of same-status preferences (e.g., preferences
for a spouse of the same religion) which find such separate preferences are only weakly
identified (if at all), as LHN had suggested. Models with constant terms, continuous
explanatory, and same-status dummy explanatory variables, in various combinations,
have been estimated.
The remaining parts of the paper consider the problem of estimating LHN’s matching
model on sample rather than population data. LHN, like other investigators of two-sided
models, used samples of data from large populations to demonstrate their estimation
method; LHN expressed a hope that using a sample would not bias their results.
The LHN model describes a single probability for an observed matching comprising all
the members of a population. These probabilities cannot be decomposed into
independent probabilities for the matches obtained by the various members, since the
preferences of the system members all interlock to determine the outcome. Each
matching of an entire population constitutes a single case. One kind of sampling that
might be considered is sampling complete matched systems from a larger population,
Logan, p. 3
as might be imagined for a marriage study conducted over similar, isolated
communities. The likelihood function of such a sample would be a product of the
independent outcomes in the different populations, and general properties of ML or
Bayesian inference would argue for consistency.
The type of sampling that is most relevant to sociological research is different. Matching
populations tend to be large, with porous and hard-to-define boundaries. It is desirable
to have models that can be estimated on samples from such large matching
populations. This might alleviate the insurmountable problem of defining the exact
boundaries of the population relevant to each particular actor, and could also make the
computational burden tolerable. So the problem of sampling from a matched population
is key to practical applications of LHN’s model.
When considering sampling from a matched population, that is, one that is in a stable
matching as assumed for the LHN method, it may first be observed that any sample of
two or more matches (i.e., matched pairs and singletons) from a stable matching must
itself be a stable matching when considered in isolation. (This point will be elaborated.)
This observation suggests basing estimation on random samples of matches from a
population, treating each such sample as a separate, stably matched system. This is
the procedure that was used by LHN.
The paper uses two approaches to consider the properties of such estimates based on
randomly sampled matches. First, consistency of the estimator is considered
analytically for the two-sided logit special case of LHN’s model, using the new likelihood
function that was the basis of the ML estimates described earlier. In the logit special
case the likelihood for the stable matching of the sampled subsystem reduces to one
based on the sampled data alone, if the uniform conditioning property described by
McFadden (1977; cf. Ben-Akiva and Lerman, 1990) applies to the sampling method. A
consistency proof for this reduced version of the likelihood seems possible.
The second approach is to simulate large populations that are stably matched with
known parameter values, and then examine estimates obtained from applying the LHN
method to random samples of different sizes. Ideally, the simulated stable population
would contain a very large (really, infinite) number of matches, but in practice I have
limited simulations to systems of size N = 4096 (a total of 8192 actors). The next figure
shows some results of estimating the parameters of this population using random
samples (with uniform conditioning) of sizes 8 through 128. The large, paired dots
represent sample means for the two parameters in the model, α1 and β1, which again
have population values of 0.5 and 0.75, represented by the dashed lines. The model is
the two-sided probit special case.
Logan, p. 4
Estimations From Size N=4096 System Using Various Block Sizes
0
Coefficient Value
.5 1 1.5 2 2.5
Dashed lines indicate parameter values.
8
16
32
64
128
256
512
1024
Estimation Block Size (N of actors on each side)
alpha 1 mean
2048
4096
beta 1 mean
The figure shows the estimates approaching the parameter values more closely with
increasing sample sizes, at least up until N=128. It also seems plausible, judging from
the decreasing rate of approach, that the estimates are approximating an asymptotic
approach, ignoring the finite population size. That is, these two-sided probit estimates
give the appearance of consistency in this simulation. (These particular simulations use
same-status explanatory variables rather than continuous ones. For this reason, the
magnitudes of the effects on the two sides are not clearly differentiated.)
(I have yet to estimate the two-sided logit special case with a corresponding series of
samples, but this will be included in the final paper.)
The rather large biases seen in the estimates using very small samples of matches
have implications for practical two-sided matching model estimation. The tsreg Stata
command includes a feature by which large data sets can be broken randomly into
smaller subsamples, and the LHN method applied within the subsamples. The result is
much faster estimation than could be obtained by analyzing entire data sets as single,
matched populations. The simulations shown here, however, suggest that this gain in
speed of estimation (which can be very substantial) needs to be weighed against the
apparent biasing effect of small subsamples. More investigations need to be done to
give a fuller description of the likely size of the bias in realistic situations.
Summary: A new Stata command, tsreg, implementing LHN’s method for two-sided
matching model estimation is presented and demonstrated. A new likelihood formulation
is presented and used to validate the Bayesian MCMC estimates of tsreg. Stable
matched populations are generated with tsreg for given parameter values, and the
resulting data sets are used for estimations that show the ability of the model to
differentiate effects on the two sides of the market. Estimation using samples of
matches from a stably matched system is investigated both analytically and by
simulation studies. Analysis using the new likelihood formulation suggests estimation
with the two-sided logit form of the LHN model is consistent. Simulations show changes
in the bias of estimates that suggest consistency, using samples of different sizes from a
large, finite population. Implications are drawn for the appropriate use of subsampled
LHN estimation with tsreg.