Donald P. Cram* Vijay Karan** and

Transcription

Donald P. Cram* Vijay Karan** and
Three Threats to Validity of
Choice-Based and Matched Sample Studies
in Accounting Research*
Donald P. Cram*
Vijay Karan**
and
Iris Stuart***
November 25, 2007
*Corresponding author. Los Angeles, California, USA. [email protected].
**Department of Accounting, College of Business and Economics, California State University, Fullerton,
California, USA.
***Department of Accounting, Auditing and Law, Norwegian School of Economics and Business
Administration, Bergen, Norway.
Three Threats to Validity of Choice-Based and Matched Sample Studies
in Accounting Research
Abstract
We consider three technical errors in the statistical analysis of choice-based and matched sample
studies in accounting research. These problems constitute threats to both internal and external
validity of the research. First, we note that researchers have often failed to control for the effects
of matching variables used in sample selection. Commonly, researchers believe that the selection
of a matched sample already controls for the matching variables, and hence controlling for them
in analyses would not be necessary, but in fact it is. Typically an unconditional analysis is
performed, rather than the conditional one that is justified. Thus, failure to account for industry,
size, and other matching variables may have driven incorrect findings in many research studies,
or may have suppressed results waiting to be revealed. Second, where matching is by “closest”
size or other continuous measure, the matching is imperfect, and there remains the possibility
that case vs. control differences in this matching variable could be the cause of differences in
outcome, so researchers must evaluate that possibility and perhaps control for it. Third, the
disproportionate sampling for different population strata that is implicit in the choice-based and
matched sample selection would usually necessitate weighting data in statistical analyses by the
sampling rates in each strata, but reweighting or other appropriate adjustment to the analysis is
often not implemented. A “logit exemption” to the need for reweighting has been noted in the
literature, but has been used in settings where it does not apply. We provide a simulation
example to demonstrate problems, and provide suggestions for more precise ways to analyze
choice-based and matched samples.
Keywords: Choice-based, matched samples, research designs, research methodology
1. Introduction
There is a long history in accounting research of employing a research design that
involves matched samples or choice-based samples. These approaches are typically employed to
limit data collection costs (e.g. Geiger and Rama 2003, Heninger 2001) or to deal with
nonlinearities that are not well specified (e.g. Kothari, Leone, and Wasley 2005). For example,
Bartov, Gul, and Tsui (2001) identified 173 Compustat firms with qualified opinions and fully
matched the firms by year, 2-digit SIC, and Big 6 or non-Big-6 auditor to a set of control firms.
The authors used logit regression to identify determinants of audit opinions qualifications. This is
1
just one of many examples. However, we find numerous examples of incorrect statistical
analysis of matched samples and choice based samples in accounting research going back over
many years and spread across all accounting journals. These incorrect methods of analysis have
gained wide acceptance within accounting research in spite of misgivings on the part of some
researchers (see Maddala 1996, Smith 2003). These technical errors in statistical analysis can
cause the researcher to either reject the null hypothesis when the null is true or fail to reject the
null hypothesis when the null is false. The purpose of this study is to describe three technical
errors that can occur due to incorrect analysis of matched samples or choice based samples, show
how they can affect statistical inferences, and recommend simple corrections.
This paper makes several contributions. First, we identify six research designs that we
discern among choice based and matched sample studies in accounting research. These
categories are defined by the manner in which the researcher chooses the sample of firms to
study, the “treatment” group, and then selects the comparison sample, the “control” group.
Clearly defining the six categories enables us to identify and explain the correct analysis that
should be used with each design.
Second, we provide guidance for correct analysis and use of univariate, ordinary least
squares regression, and logit models with choice based and matched samples. We prove that,
while maximum likelihood estimation generally requires reweighting in a choice based or
1 Early examples of choice based matched sample studies include Beaver (1966), Altman (1968) and Deakin
(1972). In excess of 300 papers using such designs in financial distress prediction have appeared since then, and
the approaches are used widely in other accounting research areas as well. We identified 73 such papers in the
area of audit research alone from 1990 to 2003. This audit research paper listing, with our assessment of
probable Errors 1, 2, and 3 occurrences, is available upon request.
1
matched sample design, there exists an exception for such reweighting in the special case of pairmatched logit regression. We also provide a proof for a pair-matched logit regression that
asymptotic correct estimates can be derived by either (a) fully saturating the regression model,
that is using a dummy variable for each pairing, or (b) using a no intercept logit regression upon
pair-wise differences.
Third, we describe how incorrect methods of analysis of matched samples or choice
based samples have gained common acceptance over many years. We identify three technical
errors found in numerous studies in accounting research and document their frequency in
auditing research during the years 1980 to 2003; Error 1: Use of unconditional analysis, when
analysis conditional upon effects of matching variables is needed, Error 2: Failure to control for
effect of imperfectly matched variables, and Error 3: Failure to reweight observations according
to differing sampling rates.
Fourth we demonstrate with simulated data how incorrect analysis, that is analysis which
fails to recognize that matched samples and /or choice based samples are not random samples,
can lead to incorrect inferences. The simulations tangibly demonstrate that incorrect analysis
may (a) fail to detect significant true effects (Type II error), (b) find false significant effects
(Type I Error), and (c) find significant results that are opposite in sign to the true effects.
Last, we demonstrate with a replication and reanalysis of a published paper, Ghicas
(1990), differences in the results obtained by correct and incorrect analysis of a matched sample.
The results of the correct analysis, unlike the original analysis in the paper finds support for a
key hypothesis and helps resolve an anomalous result the author had remarked upon.
The rest of the paper is organized as follows. In Section 2 we discuss basic terms,
describe six research design categories in matched samples or choices based samples, and
discuss the three technical errors due to the use of incorrect statistical analysis. In Section 3, we
demonstrate with the use of simulations the effects of Errors 1 and 2 on reported results. Section
4 includes a replication of a matched sample study that shows changes in results. In Section 5,
we describe the correct statistical analysis to apply within each of six distinct research categories.
Finally, we summarize and conclude.
2
2. Research Designs and Three Errors
A choice-based sample is a non-random sample where cases having one outcome (e.g.,
firms “choosing” to file for bankruptcy or firms receiving a qualified audit opinion) are
identified, and then comparison samples of control observations are selected from available data
having different outcomes. The analysis then uses the outcome as the dependent variable to be
explained by other variables. Choice-based sampling is particularly useful when data collection
is costly and one category of the outcome to be explained is rare, so random sampling from the
population would not yield very many observations of the rare type unless very costly large
samples were collected (Zmijewski 1984).
Matched sample research is another form of non-random sampling that is intuitively
appealing and widely used in accounting research. Matched samples are those having each
member matched with a corresponding member or members in the other sample or samples, with
matching by characteristics not of immediate research interest. One type of matched sample
research is a “within-subject” study, often used in behavioral and experimental settings in
psychology and medical research, that compares repeated measures for each subject taken before
and after alternative treatments. In accounting, a within-subject matched pair might be two firmyear observations, for the same firm, before and after an event of interest. Examples are studies
of audit fee level changes (e.g. Iyer and Iyer 1996 and Maher et al.,1992). In accounting usage
these studies are known as “changes” studies; they are correctly analyzed by comparing pairwise differences to pair-wise differences, an approach that explicitly takes into account their
pair-wise matching.
2
A second type of matched sample study is a between-subjects study. For example, Mack
et al. (1976) studied women in a residential retirement community population to measure
potential risk factors for a type of cancer. For each woman diagnosed with the cancer a matched
2 In accounting research terminology, “changes” studies are contrasted with “levels” studies. “Changes” studies
can be less prone to omitted variables problems. An example “levels” study would be to analyze the 4,000 or so
firm-year observations available in Compustat for one year, to explain audit fee levels as a function of various
firm characteristics. A study of the somewhat fewer firms for which data is available in each of two years can be
analyzed as a “changes” study: one regresses firm-specific difference in audit fee upon the year-to-year
differences in the various characteristics. In the latter case, one has pair-matched the firm at year 1 with the
same-named firm at year 2. Assuming that firms do not change from year to year in the unmeasured firmspecific characteristics, the firm-specific characteristics are successfully “differenced out” and should have no
bearing on the “changes” analysis.
3
set of four women of same age, marital status, and similar entry date into the retirement
community was chosen, then detailed personal and family histories were painstakingly collected,
coded, and analyzed. Differences in discrete outcome (e.g. cancer detected or not) were to be
explained by differences in measured factors of interest (e.g. prior exposure to various drugs, use
of hormonal treatments), with additional factors, say for presence of cancer in family history,
also measured and controlled for by inclusion in the model. In accounting, matched pairs might
be firm-year observations of two different firms, chosen so that the firms match in some
characteristics such as industry code and asset size.
A matched sample may or may not also be a choice-based sample; to be both, the match
selection must focus on drawing comparison sets of subjects that have opposite outcomes of the
variable that is to be explained in the analysis but that are similar on matching variables. A
sample of litigated firms paired to non-litigated firms, with matching by industry and closest
size, and to be analyzed by a logit model explaining litigation, is both (e.g. Lys and Watts 1994).
A matched sample may reflect opposite “choices” taken, but its analysis is not in the form
of a choice-based statistical analysis if the choice is not the dependent variable in the analysis.
For example, Wallace (1997) collected pairs of firms matched by industry and size, but which
made opposite decisions on whether or not to adopt residual income-based compensation plans.
In what we term a non-choice-based analysis, he ran OLS regressions of financial performance
upon a decision indicator and other variables.
We use the term “fully-matched” samples to distinguish situations in which each stratum
or case-control comparison subset is unique, and “semi-matched” samples which have strata or
pairings of case and controls that are nominally but not meaningful unique. For an example of
semi-matching, Heninger (2001) obtained 67 cases of firms whose auditors were sued, and
identified firm-year control observations by matching on year and industry and then randomly
selecting one from the available candidates. There were multiple occurrences of auditor litigation
in some industries, so the nominally 1-1 matched pairs in those industries can be combined into
fewer than 67 matched sets in analysis, and we classify the sample as semi-matched. If instead,
at the last step in matching, he had chosen the unique firm closest in size, or if all 67 cases were
in different industries, then each pairing would be distinctly defined and we would deem his
sample to be fully-matched. Different approaches to analysis are available for semi-matched
samples: for example, in a regression analysis one might include an intercept dummy variable for
4
each nominally matched pair, or include only one for each meaningfully distinct matched set.
Choice-based and matched samples are not random samples. Therefore, it is necessary to
perform statistical analysis upon them differently than would be appropriate for random samples,
in order to generate results that should generalize to the larger populations from which they are
selected. Campbell and Stanley (1966) categorize internal and external threats to validity of
research designs. They define internal validity as “the basic minimum without which any
experiment is uninterpretable”. External validity asks the question of generalizability: To what
populations, settings, treatment variables, and measurement variables can an observed effect be
generalized? We identify three ways in which accounting researchers have sometimes failed to
account for non-randomness of choice-based and matched sample selection in their analysis of
choice-based and matched samples.
Three common errors in accounting research using matched and choice-based
samples
In Figure 1 we identify six categories of choice-based and matched sample designs that
we discern in accounting research, for which we determine that varying guidance is in fact
needed. We map these out in a Venn diagram in Figure 1, where large overlapping ovals indicate
choice-based samples and matched samples. Some studies are choice-based but nonmatched
(indicated as CB-NM). These, like Palepu (1986), have sample selection based on an observed
outcome variable that is to be explained, but selection within each outcome category is random.
Within matched samples, it is necessary to differentiate between semi-matched and fullymatched samples. So, choice-based papers using matching may be fully matched (CB-FM,
having unique pairings), or may be semi-matched (CB-SM, having some groups larger than
pairs). Within matched samples that are not choice-based, there is the same fully-matched vs.
semi-matched distinction, defining NCB-FM and NCB-SM types.3 Finally, within NCB-FM we
must distinguish between the Within-subjects vs. Between-subjects studies (denoted NCB-FMW and NCB-FM-B). An example study in each of these six design categories is listed below the
Venn diagram, as well as the count of how many of each category we found in our review of
3
An NCB-SM example is Krishnan (2003), who created a nonrandom sample by identifying 15,342 firm-year
observations in Big 6 audited firms, selecting only those for which corresponding non-Big 6 control observations
in the same 2 digit SIC and cash flow deciles were available, and then added those 3316 corresponding non-Big
6 observations.
5
audit research studies.
Accounting researchers argue convincingly the importance of controlling for the industry,
size, and other variables that they use in match selection (often citing prior research), but then
fail to include the variables in their analysis, as might be done by including an intercept dummy
variable for each matched set. It is well accepted that the likelihood or levels of bankruptcy,
litigation, audit fees, and other common dependent variables will vary by industry and firm size;
in other words one has conditional information for inferring the outcome from knowing the
industry and firm size. Also, the accounting ratios and other independent variables commonly of
interest also vary systematically by industry and firm size. In an industry- and size-matched
sample, therefore, the vector of intercept dummies that would control for industry and size is
correlated with both the outcome variable and with the explanatory variables. In ordinary least
squares regression analysis, omission of such a correlated variable causes an omitted correlated
variables problem, rendering coefficient estimates biased and inconsistent. In logit and probit
regressions, it is arguably worse: coefficients on included variables will be estimated
inconsistently even if the omitted variables are uncorrelated with the included variables.4
Researchers have believed that selecting the control sample using matching will by itself
ensure that results have been controlled for the effects of the matching variables, and that the
matching variables therefore need not be included in the statistical analysis. However, as has
been prominently noted in guidance provided in the biomedical field, “the matching process
requires that the data …be analyzed with the matching taken into account” (Breslow and Day
1980, p. 32). To account for the matching within any regression model estimated in a matched
sample, one must fully saturate the model, i.e., one is either to include a dummy variable for each
stratum, or to perform analysis on differenced data within each stratum. For fully-matched paired
designs analyzed by Ordinary Least Squares (OLS) regression, that means adding a dummy
variable for each pairing, or, equivalently, performing analysis on differenced data, i.e., by
regression of pair-wise difference in outcome upon pair-wise differences in independent
variables. For matched designs to be analyzed by logit regression, when the outcome is a 0-1
categorical variable, accounting for the matching is similar although there is a complication, to
4
We thank an anonymous reviewer for clarifying this point regarding nonlinear analyses.
6
be explained, that motivates use of specialized software for correct implementation.5 In
accounting research, however, estimated models commonly omit the use of the match selection
information, effectively omitting the effect of matching variables on the dependent variable.6
We term this omission Error 1.
Error 1: Use of Unconditional Analysis, when Analysis Conditional upon Effects of
Matching Variables is Needed
Introductory guidance for non-random sample research designs is found in introductory
statistics textbooks such as Johnson and Bhattacharyya (1985) and Rice (1995). These introduce
the idea that for univariate comparisons in experimental settings where there is a natural pairing
in the data, a matched sample t-test (a one sample test) is more powerful than an unmatched (two
sample) t-test in detecting a mean difference in a given measure. Researchers may have
misperceived that either is appropriate; we must note that an Error 1 has occurred, however,
when a matched pair t-test is required but a two sample, unmatched t-test is performed instead.
The first technical problem (Error 1), that is the use of unmatched analysis for matched samples,
we observe first in the seminal papers by Beaver (1966 and 1968)7, and Altman (1968), on
bankruptcy prediction. The use of unmatched analyses has persisted in the literature ever since.
Deakin (1972, p. 172), citing statistician Tatsuoka (1971), noted that Altman (1968)’s
discriminant analysis of a pair- matched sample would have required "more complex
procedures" to be correct, but only one subsequent paper in accounting research implemented
those procedures.8 All other discriminant analyses of matched pair samples in accounting and
5
See Appendix I for development.
6
In our audit research review, we find 67 papers that employed matching, with 55 of those failing to control for it in
their analyses.
7
Beaver (1966) applied both matched and unmatched analysis; he stated preference for the unmatched analysis as
the output seemed to provide unconditional probabilities (that would be justified in a random sample) and he
reported only unmatched analysis in his 1968 paper.
8
A well-executed study by Harrison (1977) is the only matched pair discriminant analysis which we can identify
that addresses the concerns to which Deakin alluded. It is a CB-FM study employing pair-matching of firms
having an accounting method change to controls that did not change accounting method, with matching by
industry, year, and risk measured by beta. Harrison also references Tatsuoka (1971) in his application of a
Hotelling T2 test (a multivariate generalization of a matched-pair t-test) to determine whether the single sample
of pairwise differences in market returns differed from zero. This avoided Error 1 by use of a one sample
(differences) test. Essentially he examined whether pair-wise differences were significantly different than zero.
His work is subject to Error 2, however, in that his results could have been driven by residual differences in his
matching variables, although he did perform one sensitivity analysis to attempt to address that. Further, his work
7
finance appear to have been performed incorrectly in that they failed to take into account the
pairings. Maddala (1998), discussed below, entirely dismisses the bankruptcy prediction studies
that used matched samples. We develop here that choice-based and matched sample studies,
analyzed correctly in ways we describe fully, can usefully find conditional effects.
Schlesselman (1982) comments on unmatched analysis of matched data that “If cases and
controls have been matched on a variable that is associated with the study exposure, then an
analysis that does not account for the matching will result in an estimate of the odds ratio that is
biased toward unity” (p.272). Our simulations show that the impact of Error 1 is worse: bias can
go in any direction.
While awareness within accounting research of problems relating to unconditional
analysis is unfocused, Error 1 has been commented upon in a literature of discrimination in
mortgage lending. Giles and Courchane (2000) note that inconsistency of estimates
appears to have been missed… Several authors recognize that stratifying [use of
matching in sample selection] will affect the estimation of the constant term, but then fail
to realize that the inclusion of a racial group dummy variable results in separate stratum
constraints. In particular, reference is made to the discussion in Maddala (1983, pp 9091) and Maddala (1991, pp. 792-793), which relate to stratifying by outcome only; we
need to extend the results when we also stratify by a dummy variable covariate. (pp 8-9).
Dietrich (2001), examining the impact of that misanalysis in mortgage lending, has
reported the striking conclusion that 6 of 23 studies, when replicated and reanalysed
appropriately, have their results reversed. We observe That the textbook examples of matched
sample studies differ in crucial ways from the non-random observational studies common in
accounting research in which matched sample t-tests are often applied. First, in accounting
research matching is often by closest size or other measure, and Error 2 is possible, that is
difference in outcome may be driven by residual difference in the matching variable. To address
this, as we explain in Section V, an OLS regression generalization of the t-test is required.
Error 2: Failure to Control for Effect of Imperfectly Matched Variables
A second technical error (Error 2) in the analysis of matched or choice-based samples
is subject to Error 3 as discriminant analysis does not benefit from the logit exemption. In fact it may not be
possible to avoid Error 3 in discriminant analysis of CB-FM studies: reweighting cannot be applied to each
datum according to its stratum'
s sampling rate, when data have been differenced. Tatsuoka (1971) provides
guidance on applying prior probability information in some statistical settings, but not specifically in the
matched pair example he gives (and Tatsuoka’s example itself suffers from Error 2 and Error 3).
8
occurs when the matchings are not exact. This error stems from imperfection in the matching
process, e.g. from selecting pair-wise controls that are “closest” rather than exact matches on a
continuous variable such as firm size. Where closest matching is used, the researcher should
consider explicitly the possibility that the remaining pair-wise differences in the matching
variable may itself be sufficient to explain observed patterns of outcomes, and attempt to control
for this possibility. For example, the researcher can try including linear and/or quadratic factors
of the size variable in the model. Frequently accounting researchers have failed to control for this
problem in estimation when closest matching is used.9
The second technical problem (Error 2) has been noted at times in the accounting
literature. Some authors have included in their models the continuous variables such as size
which they used in matching, explicitly to control for the residual effect of pair-wise imperfect
matching in those variables (e.g. Lys and Watts (1994) and Carcello and Neal (2003)), while
others have not. Of 37 audit papers needing an evaluation of the impact of imperfect matching,
27 did not provide it, so discerning consumers of the research would be left with uncertainty as
to the likely impact of this omitted correlated variable upon reported results.
Error 3: Failure to Reweight Observations According to Differing Sampling Rates
A third technical error (Error 3) in the analysis of matched or choice-based samples arises
from the fact that these non-random samples’ numbers of observations in outcome groups or in
matched sets are not proportional to the size of their categories in the general population. The
choice-based and/or matched sample selection process creates stratified samples that are
deliberately not proportionally representative: the rare outcome is represented in the sample as
often as the common one; the large industry has no more representation than the small industry.
10
Second, with the exception of NCB-FM-W designs, accounting studies also differ in that they
employ non-random selection, hence Error 3 or non-generalizability will apply, unless the logit
9 Of the audit research sample, 37 papers used closest matching and, of those, 27 failed to include at least a linear
term for that matching variable in their analysis.
10 Five of six research design categories described later are non-random. The exception is certain within-subject
studies. For example if cases for a before-and-after study of audit fee levels are randomly selected from all
potential subjects having data availability at the before and after times, the results are fairly generalizable to the
population of continuing firms. (A distinct survivorship bias is potentially present; audit fee level changes
identified within continuing firms may not generalize to the whole population that includes entering and
departing firms.)
9
exemption is enjoyed or unless reweighting of each observation in the analysis is employed.
Besides in NCB-FM-W settings, application of a matched pair t-test towards ascertaining a group
difference in an accounting ratio, does not yield any generalizable result. Hence we find the
classical introductory statististics textbook discussion is adequate only for guiding univariate
analyses in the NCB-FM-W research design, and does not address the other five research design
situations.
In accounting research, non-random samples are commonly analyzed as if they were
random. In marketing research, by contrast, where stratified sampling (one kind of non-random
sampling) is commonly employed, it is well understood that to preserve generalizability it is
necessary to reweight the data. A stratified sample taken to assess a univariate measure (e.g.
proportion of likely buyers of a product, or likely voters for a political candidate) is analyzed by
reweighting: down-weighting observations from the over-represented strata, upweighting the
under-represented. The usual method to make use of nonrandom samples in regression and other
analyses, also, would be for the researcher to reweight observations analogously, weighting each
observation according to the sampling rate applied in selecting from its strata of the larger
11
population . In accounting research, an exception to the general need for reweighting has been
noted and applied for some logit regression studies such as Palepu (1986). However most
applications in accounting research fail to reweight and fail also to conform to the limited
requirements for the logit exemption to apply, and hence the generalizability of their results is in
12
question.
On Error 3, there is recognition that reweighting or the use of logit regression is
necessary in the analysis of choice-based sampled data. The logit exemption to reweighting has
11 Examples of reweighted analyses in accounting research are Zmijewski (1984), Dopuch, Holthausen and
Leftwich (1987), and Koh (1991).
12 A limited version of the logit exemption was described and endorsed prominently by Maddala (1991). The
limited version, which involves applying a logit model as if the sample were randomly selected, applies only to
settings with choice-based such as Palepu (1986)’s sample of reorganized firm cases compared to a control
sample selected randomly from non-reorganized firms, hence there is stratification by outcome alone. When
pair-matching or other further stratification within the control sample selection is utilized, the limited version
does not apply and adjustments that fully saturate the model are necessary for the logit exemption to apply (as
will be developed in Section III). Of the 73 audit research papers, 42 suffer Error 3 and are not logit regression.
There are also 22 logit regression papers that would need to utilize a fully-saturated model for the logit
exemption to apply, but do not, so Error 3 applies for these as well. Using abbreviations defined below, we
identify only two WESML studies, two CB-NM logit studies and five NCB-FM-W studies not suffering Error 3.
10
been discussed in accounting research since Palepu (1986). Palepu noted that only the intercept
term in his logit analysis of a choice-based sample was biased, and even that could be corrected
by an adjustment using exogenous population frequency information, citing statisticians Manski
and Lerman (1977) and Manski and McFadden (1981). Zmijewski (1984), citing Palepu’s
working paper, reviewed 17 papers on bankruptcy and advocated the use of reweighting, in
particular the use of weighted exogenous sample maximum likelihood (WESML) to make
adjusted estimations.13 Greene (2004), summarizes the issue for choice-based sampling
succinctly: In what we infer were CB-NM studies of loan default, “the dependent variable
measured the occurence of loan default, which is a relatively uncommon occurence. To enrich
the sample, observations with y=1 (default) were oversampled. Intuition should suggest
(correctly) that the bias in the sample should be transmitted to the parameter estimates, which
will be estimated so as to mimic the sample, not the population, which is known to be different.”
For CB-SM or CB-FM studies, we note the oversampling would vary by strata within outcome
group. Greene then explains the WESML estimator that would address the CB-NM case
correctly, and does not consider the logit exemption.
Maddala (1991) reviewed Zmijewski’s and other tabulations of logit and probit
applications, and concluded that WESML is not needed in logit settings, and endorsed the use of
logit in Palepu (1986) and two other papers.14 In an earlier paper, Maddala (1983) had described
the problem of Choice-based sampling as “a case of stratification by an endogenous variable”
and had gone on to state that “Manski and Lerman (1977) showed that treating choice-based
samples as if they were random and calculating estimators appropriate to random samples will
generally yield inconsistent estimates.” He notes that logit coefficients besides the intercept
would not be biased in analysis of 0-1 outcome choice-based samples.
13
Zmijewski’s preference for reweighting may have been influenced by his incorrect assertion that it is discriminant
analysis, not logit analysis, which provides estimates unbiased but for the intercept. Discriminant analysis has
strong distributional assumptions that are usually not justified in accounting research.
14
The two papers are Dopuch, Holthausen, Leftwich (1987) and McNichols and Dravid (1990). We note these
authors had selected their control samples with semi-matching on year which was not accounted for in the
analysis. To obtain technically correct coefficients on the research variables of interest in these datasets, the
researcher applying a logit analysis would need to control for the levels of the matching variables by including a
dummy variable for each year. If WESML is used, different weightings would have to be applied to each year’s
strata. While the technical error may well not have had a significant impact in these studies (having only two
years of data, and those years perceived to be similar), the error may be very significant in studies involving data
over different time periods. Maddala endorsed these papers without calling attention to this problem.
11
Maddala’s suggestion has been widely cited by accounting researchers who thought that
as long as logit regression was used, coefficients other than the intercept would not be biased.
This is not always true. The logit exemption allows the use of unweighted logit regressions to
analyse choice-based matched sample data, delivering asymptotically unbiased coefficient
estimates and standard errors on non-intercept variables, providing that the model is fully
saturated.15 The typical logit application in accounting research, however, has estimated an
unweighted and unsaturated model, which does not control for the matching variables’ effects
and does not enjoy the logit exemption from need to weight data to reflect population
proportions. We also observe that researchers have considered unweighted estimation of choicebased samples as being acceptable outside of logit settings, too. Not adequately appreciated is the
logical corollary to Maddala’s statement: if logit regression is not employed, then analysis of a
non-random sample as if it were a random sample is not acceptable. By not accounting for the
matching in the analysis, the analysis suffers from the omission of correlated variables, leading
to bias in all coefficient estimates, including unpredictable biases on those of research interest.
Almost all of the published discriminant, logit, and probit analyses of matched samples in
accounting research have been misanalyzed along these lines.16 We believe that the persistence
of the incorrect practice is due to the guidance perceived to have been provided by Zmijewski
(1984), Palepu (1986), and Maddala (1991); they are cited frequently in our sample of audit
research studies. A typical quote is as follows: “Maddala (1991: 793) argues that if this choicebased sample is used to estimate a logit model, no weighting procedure is needed. The
coefficients of the explanatory variables are not affected by the unequal sampling rates. It is only
the constant term that is affected.” This seems to be an accurate statement of Maddala’s 1991
position as Maddala did not preface it with a disclaimer that its applicability was restricted only
to non-matched samples. In fact, when matching is also present, constant terms for each matched
set must also be included, and each of those will be affected, but will permit accurate estimation
of the research variables of interest. While Maddala did not note these complications in his 1991
15
A model can be fully saturated by including a dummy variable for each stratum, e.g. including a pairidentification dummy for each pairing in a fully matched sample. Alternatively, the model can be posed on pairwise differenced data. The latter is preferred for logit analysis.
16
The two exceptions are the aforementioned Harrison (1977) and Burgstahler et al. (1989), who uniquely applied a
no-intercept probit model to pairwise differences in bankruptcy prediction, avoiding error 1. We have not found
any other exceptions in accounting research in print prior to 2004.
12
work, in a later review for a Handbook of Statistics article (Maddala, 1996), he broadly
dismissed the use of matched samples, stating that “a logit analysis based on ‘matched samples’
cannot tell us anything about the effects of measured characteristics on failure rates” (p. 560). It
is unclear to us whether Maddala (1996)’s strong dismissal referred only to the incorrect
unconditional analysis commonly applied, or whether he would also have had reservations about
conditional analysis, but at any rate only his endorsement of logit analysis has been cited widely
while his dismissal has gone unnoticed in accounting research.17
We rely upon a correct statistical theory for the analysis of choice-based and matched
samples that has developed largely in the biostatistics literature, where it has been shown that
logit models in fact can very usefully discern conditional (within cluster) effects from matched
samples. This literature supports our society’s massive investment in cancer and other medical
research. Breslow (1996) provides a historical review. Briefly, Anderson (1972) and Prentice and
Pyke (1979) made key contributions, establishing that appropriate logit analysis can yield
coefficient estimates (besides the intercept) and corresponding standard errors that are valid
when performed on fully-matched sample data . Breslow and Day (1980, 1987) developed and
popularized the matched sample methodology, leading to wide application in medical research.
Monographs by Schlesselman (1982) and Hosmer and Lemeshow (1988) serve practitioners.
There are few citations in accounting of this literature, besides several citations of Hosmer and
Lemeshow’s discussion of the logit model in general terms; their chapter describing the
appropriate analysis of case-control paired samples seems not to have been noticed or understood
to be applicable.
A complication in the appropriate logit analysis of choice-based pair-matched samples
has been noted, and has been developed by Abrevaya (1996). It turns out that appropriate logit
analysis of pair-wise differences yields the same relative estimates of coefficients as does logit
analysis of pooled, non-differenced data with pairings accounted for by inclusion of a dummy
variable for each pairing (less one, or without an overall intercept). However, the latter approach
yields coefficients that are exactly twice the magnitude, but standard errors are not scaled
proportionately, so different p-values are reported. As the scaling of logit coefficients is
arbitrary, either method is correct for obtaining coefficient estimates, but it is the differences
17
No paper in our extensive review of accounting research has cited the late Maddala'
s 1996 paper, and the Social
Science Citation Index shows no such citations.
13
analysis that has the correctly corresponding standard errors and p-values and that is correct for
use in inferences. For logit analysis, the approach of including pair-wise dummies overstates the
significance of its coefficients. The differences approach is easily implemented in SAS software
by application of its PROC LOGISTIC with use of STRATA command to identify pairings, or in
Stata software by application of its CLOGIT with similar use of its corresponding STRATA
command.
Modern econometrics textbooks have little mention of matched samples. Heckman,
Ichimura, and Todd (1998), however, show that the correct treatment is well understood among
current econometricians. They dislike matched sample studies, instead preferring to model
selection processes explicitly. This criticism seems particularly apt for non-choice-based studies
of management decisions such as Wallace (1997). Biostatisticians, on the other hand, would use
matched samples to quickly investigate possible relationships in a new area, and then apply
randomized experiments, longitudinal studies, and other approaches to deepen knowledge.
Econometricians’ preference to model the selection processes explicitly would require more data,
and would preclude, for example, the choice-based matched sample studies that biostatisticians
and accounting researchers sometimes employ involving analysis of costly hand-collected
variables for strategically chosen observations.
Bergstrahl, Kosanke, Jacobsen (1991) provide an efficient means to identify matches for
matched sample selection in SAS software, with both optimal and greedy algorithms.18 Parsons
(2002) provides another SAS software implementation of a greedy algorithm. Barber and Lyon
(1986, 1987) and Kothari, Leone, and Wasley (2005) discuss potential strategies in selecting
matched samples. They advocate application of what are termed propensity scoring methods;
Heckman, Ichimura, and Todd (1998) however provides a critical dismissal of those approaches.
The focus of the present paper is upon the correct analysis of choice-based and matched samples,
once selected, rather than upon the match selection process.
In the next section we go on to demonstrate the potential problems of Errors 1 and 2 in an
extended example.
18
BKJ(1991)’s SAS macro offers options for weighting on multiple matching variables and for greedy matching
(for the first treatment observation, selecting the closest match available) versus more strategic algorithms. We
employed the macro in our simulations and recommend its use.
14
3. Simulation-Based Example and Statistical Theory
In this section we provide an extended example of Errors 1 and 2 using logit regression.
We use a simulation to generate settings where "true" parameter values are known, allowing us
to compare the performance of alternative methods of analysis used in practice. This enables us
to explore which methods of analysis lead to incorrect conclusions, and which methods are more
efficient in converging to correct conclusions. This exercise will demonstrate ways that past
research may be incorrect: a) true effects may be suppressed, b) non-existent effects may
erroneously be found to be significant, and c) effects can be miss-measured, even to the extent
that an apparent effect in one direction may be found when its true effect is in the opposite
direction.
Suppose that a discrete 0-1 outcome Y is hypothesized to be driven by two variables of
interest, X1 and X2, as well as by a '
nuisance'categorical variable Z. This would be an
appropriate model, for example, in research attempting to measure the effect of market value
(X1) and of discretionary accruals (X2) upon whether or not an audit-related litigation (Y) occurs,
where it is already known that industry membership (Z) has a significant influence upon the
outcome.
We simulate this situation by generating a population of X1, X2, and Z data and then
generating outcomes Y that are a function of those plus a random error term. Arbitrarily, we let
there be five "industry" groups (five categories of Z), that will have (X1, X2) values in clusters
centered at (1,1), (3,1), (5,1), (7,1) and (9,1). In order to approximate variables in accounting
research, we generate X1 and X2 to be normally distributed and correlated. Specifically, within
each group, we generate 5000 observations of (X1, X2), distributed bi-variate normal with
correlation of 0.4. These data will be the independent variables in each of the three simulations
that follow.
Simulation 1 Demonstration of an Error 1 Impact: (a) Failure to Find a True Effect
Consistent with the distributional assumptions of logit regression, we generate values Y
according to the following formula:
y i = 1 if α j + β 1 x1i + β 2 x 2i + ε i > 0 ;
yi = 0 Otherwise;
15
where j is an intercept specific to the jth of J=5 groups in the population, β1 and β2 are the
coefficients reflecting X1 and X2’s influence upon the outcome, and i is a logistic distributed
random error term having mean 0 and variance 2=1. In this first of three cases, we set β1 =1,
and β2 = 1. In order to generate a mix of 1 and 0 outcomes within each group, given the
distribution of X1 and X2 and the chosen values of β1 and β2, we set j values at -2, -4, -6, -8, and
-10. This yields more outcome 1’s above and to the right of each group’s center, and more
outcome 0’s below and to the left. This simulation yielded 12,477 outcome 1’s and 12,523
outcome 0’s.
Assuming there was a significant cost to data collection, a researcher might reasonably
choose to study the relationship of X1 and X2 to Y by gathering a limited choice-based matched
sample. We simulate this by selecting, at random, 50 out of the 12,477 observations having
outcome 1. And, then we create a comparison sample by randomly selecting, for each one of
those, a matching observation from those having outcome 0 and appearing in the same
“industry” group. This yields 100 observations in 50 nominal pairs that may be analysed in
various ways. A scatter plot of the data generated for 50 pairs is included in Table 1. We create
increasingly larger samples of 100, 200, 400, 800, and 1600 pairs by adding observations
selected in the same way. Results of analysis for each sample size are tabulated in Table 1.
First consider analysis as has been most commonly done in accounting research, i.e.
running the logit regression:
y i = 1 if β 0 + β 1 x1i + β 2 x 2i + ε i > 0 ;
yi = 0 otherwise,
where β0 is a single overall intercept that is estimated and i is an error distributed according to
the logistic distribution. This is an unconditional, pooled analysis that we term “unmatched”.
Note, it uses neither group identifier information nor pairing information. Logistic regression
software finds the maximum likelihood estimates for βˆ0 , βˆ1 , and βˆ 2 by numerical search that
maximizes the product of probabilities for occurrence of the observed data, where for outcomes
y i = 1 the probability is:
16
[1]
Pr( yi = 1 | x1i , x 2i , ) = F ( β 0 + x1i β 1 + x 2i β 2 ) =
exp( β 0 + x1i β 1 + x 2i β 2 )
1 + exp(β 0 + x1i β 1 + x 2i β 2 )
[2]
and for outcomes yi = 0 the probability is:
Pr( y i = 0 | x1i , x 2i , ) = 1 − F ( β 0 + x1i β 1 + x 2i β 2 ) =
1
1 + exp( β 0 + x1i β 1 + x 2i β 2 )
[3]
where F is the cumulative density function of the logistic distribution.
In the first two columns of Table 1 we report selected results of this unmatched analysis
applied for 100 observations in 50 pairs: estimated coefficients βˆ1 =-.083 and βˆ 2 =1.56, with
corresponding p-values of .32 and <.0001. We do not report βˆ0 . A researcher could infer that X2
affects Y, while X1 does not, when in fact true β1 = β 2 so X1 and X2 affect Y equally. The next
question we consider is whether increasing the sample size would allow estimation to identify
what we know, by construction, to be true, that the two variables have an equal and positive
effect on Y? Continuing down the column within Simulation 1, we observe that increasing the
sample size up to 1600 pairs does not accomplish that. Unmatched analysis applied to all 25,000
observations in the simulation eventually yields a statistically significant coefficient βˆ1 =.086, but
its estimated magnitude is a small fraction, about seven percent, of the estimated coefficient βˆ 2 .
Unmatched analysis in this case is inconsistent: it will not converge to the true values even
asymptotically.
Now consider an analysis that takes into account the pair-matching. This is implemented
by running a conditional logit regression. This is essentially a no-intercept logit regression of
pair-wise differences in Y upon pair-wise differences of the independent variables.19
19
As developed in Appendix 1, the estimation in effect maximizes the product of observed probabilities that the
pair-wise difference is 1, as a function of pairwise differences in X1, X2, and the vector of industry dummies Z.
To implement this, a numerical search is run to maximize the product of probabilities:
Pr(( y i − y j ) = 1 | xi1 , xi 2 , x j1 , x j 2 , z i , z j ) =
exp(( β 0 + β1 xi1 + β 2 xi 2 + γz i ) − ( β 0 + β 1 x j1 + β 2 x j 2 + γz j ))
1 + exp((β 0 + β1 xi1 + β 2 xi 2 + γz i ) − ( β 0 + β1 x j1 + β 2 x j 2 + γz j ))
17
Disconcertingly for some, the pairwise difference in Y that is the dependent variable is uniformly
one for each observation, giving rise to a seeming paradox in estimation. It may seem
impossible to estimate such an expression, i.e. regressing a vector of 1’s on a vector of data times
coefficients to be estimated.20 If this were a regular logit regression, having all outcomes 1’s
would mean that it could not be estimated: it would be a situation of “complete separation” and
the maximum likelihood estimation procedure would find that increasing coefficient estimates
continually towards infinity would indefinitely continue to increase the likelihood. Resolution of
the paradox is found by noting that this is not a regular logit regression, but instead this is a nointercept regression. Here, one is finding the unique coefficients maximizing a likelihood
expression subject to a very strong constraint that the intercept value is fixed. In an analogous
no-intercept OLS regression, the intercept would be zero, but here, when pairwise difference in
each independent variable is zero, the conditional probability value that it is Yi = 1 and Yj = 0
rather than the reverse, is in fact ½. Maximizing the appropriate likelihood expression enforces
that; this is easily implemented in standard statistical software.
The next two columns give conditional logit coefficient estimates based on the same 50
pairs of data: estimated coefficients βˆ1 =.89 and βˆ 2 =1.89, both different from zero at
conventional significance levels. Estimation on increasing amounts of data up to 1600 pairs
yields estimated coefficients βˆ1 =.989 and βˆ 2 =.966, which are close to their true values. The
simulation suggests that conditional logit estimation is consistent, i.e. that it converges upon the
true coefficient values. We provide a proof that the conditional logit estimation is in fact
consistent in Appendix A.
Note, within this simulation, from 50 pairs on, a true effect that X1 contributes positively
to Y is revealed in the conditional logit analysis, but is concealed in the unmatched analysis until
=
exp(( xi1 − x j1 ) β 1 + ( xi 2 − x j 2 ) β 2 )
1 + exp(( xi1 − x j1 ) β 1 + ( xi 2 − x j 2 ) β 2 )
Note that in taking the pairwise difference, the overall intercept drops out, as does each pairwise difference in
industry, Z. Thus the vector of industry effects is not estimated. The expression can be interpreted as the
conditional probability that it is Yi = 1 and Yj = 0, rather than the other way around, given that one is 1 and the
other is 0. For convenience in estimation, however, we reorder as necessary so Yi – Yj = 0 always. Note when
X1i – X1j =0 and X2i – X2j =0, the expression simplifies to ½, the intercept probability value.
20
The paradox has puzzled accounting researchers and, in general, led some to avoid matched sample designs and
led others to perform pooled, unmatched analyses of matched sample data that are not justified.
18
400 pairs are collected and analyzed. This demonstrates a Type II error stemming from Error 1: a
significant effect is not identified when in fact it is true. The industry groupings from columns 5
and 6 will be discussed below.
Simulation 2 Demonstration of an Error 1 Impact: (b) Finding a False Effect
If only Type II errors were caused by misanalysis, perhaps one could still trust
unmatched analysis that achieved significant results, as if the results were shown despite a bias
against finding them. But what if there was no real contribution by a variable, can a Type I error
be found?
To examine this possibility, we keep the same distribution of X1, X2, and Z, and
regenerate Y as a slightly different function, now setting β1 =1, and β2 = 0. In this second
simulation, X2 has no contribution to Y, and one would hope that analysis will not erroneously
identify a significant coefficient βˆ2 . In the Simulation 2 panel of Table 1, we present results of
analyzing successively larger samples. In the first columns of this panel, see that coefficient
estimates go to .087 and to .303 in the unmatched analysis. In the conditional logit, estimates of
.988 and .079 are obtained that are closer to the 1 and 0 true values. In fact these conditional logit
results are not significantly different than the corresponding true parameter values of 1 and 0,
before 1600 pairs the conditional logit analysis correctly identifies no significant effect of X2. In
unmatched analysis, however, a Type I error occurs: a highly significant positive influence of X2
is erroneously assessed.
Simulation 3 Demonstration of an Error 1 Impact: (c) A Sign Reversal
With unmatched analysis there will be some degree of misestimation of coefficients due
to the omitted correlated variables issue in any setting where there are correlations among the
independent, outcome, and matching variables, as demonstrated in simulation 1 and 2.
Simulation situations where misanalysis finds even more disturbing results that are the opposite
of true effects are not difficult to find. We obtain such a situation by continuing to rotate the
relative effects of X1 and X2, within each industry, on the outcome variable. As before, we keep
the same distribution of X1, X2, and Z, but now regenerate Y setting βˆ1 = -4, and βˆ2 = 1. We find
a sign reversal effect of misanalysis: unmatched analysis yields a mistaken result that β2 is
significantly negative when in fact it is positive.
19
Results are reported in the Simulation 3 panel of Table 1. In the 200 pair case, the
conditional logit analysis performs well, identifying estimates of -4.49 and .810, which are
statistically different than zero and not far different from the true values. With larger sample
sizes, the estimates improve, as before. But, for the unmatched analysis, the initial estimate
is βˆ1 =-.164 and βˆ2 = -.207, not close to the true values of -4 and 1. Going to larger sample sizes
the unmatched analysis continues to lead to the erroneous conclusion: at 1600 pairs the
unmatched analysis estimates βˆ2 = -.18. The unmatched analysis has identified a negative effect
for X2, when in truth X2 has a positive effect that is shown in the conditional logit analysis.
Simulation 4 Demonstration of Error 2: Failure to Control for Imperfect Matching
Up to this point we have considered simulations where the matching was exact: each
outcome 1 case was matched to an outcome 0 control having exactly the same industry. To
illustrate the potential for Error 2, failure to control for residual effects of imperfectly matched
variables, we revise the simulation process slightly. Suppose now that the researcher is only
interested in the effect of X2 on outcome, and chooses to match by industry group and by closest
X1. One could interpret X1 as a firm-size measure, perhaps the log of total assets. Using the same
distributions of variables and the same true equal and positive effects relating X1 and X2 to
outcome as in simulation 1, we perform the matching by industry and now also by closest X1.
We must consider the results of analysis with and without including control for residual
differences in X1. In a modified simulation run with 1600 pairs, with X2 alone in the model, the
conditional logit estimate βˆ 2 is 1.4229 (standard error .0913, p-value <.0001).21 With both in the
model, however, we obtain estimate for βˆ 2 of 1.1041 (standard error .1018, p-value <.0001).
Estimation is by conditional logit with stratification on pair identifiers. Including X1 as well as
X2 in the model “soaks up” the effect of residual differences in X1 upon pairwise differences in
outcome, and obtains an estimate close to the true value of β2 = 1. In this simulation setting, we
know that the form of X1’s effect on outcome is that it enters linearly, and hence including it in
the model controls for residual effects correctly. In other settings one might include a quadratic
21
For brevity, we do not tabulate results for different numbers of pairs.
20
term as well.
The point of this example is that the residual difference from imperfect matching can
drive results in analysis omitting control for that residual difference. Interestingly, in this
situation the estimate for β1 at 22.3933 (standard error 3.4813, pvalue <.0001) is not very close
to 1, the true effect of X1. That perhaps occurs because the simulation, as implemented, obtains a
domain of pairwise differences in X1 that is vanishingly small. A precept to observe is that one
cannot accurately measure the effect of a matching variable upon outcome. When the effect of a
variable is of interest, it should not itself be used as a matching variable. To reiterate, the first
analysis, omitting the residual effect of X1 suffers error 2 (omission of control for residual
differences) yielding an estimate 4 standard errors away from the true value β1=1; the second
analysis, controlling for it, is “spot on” the true value.
Potential for Simulation To Explore Error 3
The simulations presented above do not illustrate Error 3, non-generalizability of results.
In these simulations the error distributions conform to logistic distribution, and the logit
exemption to a need for reweighting applies. Error 3 can occur in univariate analysis or OLS or
probit regression analysis, where the data must be reweighted according to sampling rates in
each stratum or else the analysis is not generalizable to any universal population. The impact of
Error 3 could be explored by simulations in such non-logit settings, by comparing unweighted
vs. appropriately reweighted analyses.
Semi-Matched Analysis Versus Fully Matched Analysis
A further potential improvement in the efficiency of analysis in Simulations 1, 2 and 3
can sometimes be obtained by recognizing the similarity of pairings within each Industry group.
In the samples selected, we know that there are sets of multiple pairings within the same
industry, hence sharing the same true value of industry intercept j. We have not yet exploited
this additional information. To do so, we run conditional logit analysis stratified upon the 5
industry groups, instead of stratifying on the 50 or more pairings. The results are reported in
columns 5 and 6 of Table 1. The tabulation shows that this yields very similar results.
We have noted in other simulations, however, that use of the industry groupings can be
more efficient. For the simulation 1 setting but with correlation =.9 rather than .4, for example,
21
we find that this yields statistically significant results sooner. With just 50 pairs, the estimate for
β2 is significant at the .05 level, while equivalently significant results for β2 are not found until
400 pairs, for the conditional logit using pairings. Both approaches appear to yield the correct
estimates, as do results using industry groupings in Simulations 2 and 3. The Simulation 1, 2, and
3 settings were in effect CB-SM not CB-FM settings. We provide a proof, in Appendix B, that
conditional logit provides asymptotically correct estimates for the effects of interest, as long as a
fully saturated model (including one intercept for each group) is used. Including multiple
intercepts for each separate pairing within each industry is duplicative and reduces the degrees of
freedom in the analysis.
The simulations have demonstrated that Error 1 and Error 2 problems can be very severe.
In the next section we provide further evidence from replications.
4. Replication and Reanalysis of Two Choice-Based Matched
Sample Studies
Errors in analysis can also be illustrated by replications. We replicate two studies here.
Replication of a matched sample study tangibly demonstrates that when data is analyzed taking
into account the matching, estimates are different than when unmatched analysis is used. Our
first replication, of a medical example, illustrates that the difference uncovered may not be very
great in magnitude, but the context can be such that even a small difference is very important.
Second, our replication of Ghicas (1990), suggests that both Type 1 and Type 2 errors were made
in an accounting research analysis, and resolves an anomaly in the results that Ghicas had noted.
The Mack et al. (1976) retrospective study of cancer in a retirement community is a
prominent study published in The New England Journal of Medicine. It was employed as a
running example in Breslow and Day'
s (1980) monograph on the use of matched samples. It is
employed as an example of conditional logit analysis within SAS software documentation. We
chose it for convenience and also because the scientific community has weighed in on which is
the correct analysis. The goal of this case-control analysis is to determine the relative risk of
cancer that having a gall bladder disease condition contributes, while controlling for the effect of
hypertension. The researchers chose matching within one year of age, same marital status (evermarried or single), and living in the community at the time of diagnosis of the patient’s disease.
It is a retrospective study and the cost of data collection was high: they interviewed subjects,
22
collected medical histories, clinical records, and prescription history.
Using the subsample of the study’s data that is included in the SAS documentation of
PROC LOGISTIC, Example 42.10, we replicate and then vary the analysis. Table 2 presents the
SAS documentation reported results, and our results applying unmatched and matched analysis.
We obtained conditional logit analysis results identical to those in SAS documentation. In this
example the coefficient estimates vary only slightly across specifications, and there is no Type 1
or Type 2 crossing of significance levels, as to the importance of any variable. But this medical
example is one in which it especially easy to understand that even small differences do matter.
Here, the statistical significances of the gall bladder condition factor as a variable does not
change very much, going from p-value of .0675 for unmatched to .0770 for conditional logit.
The statistical significance of the Hypertension variable remains insignificant (although
including the variable does provide value in the interpretation). The gall bladder variable'
s
coefficient estimate changes from .8417 to .9704, an increase of only 10%. The odds risk ratio, a
nonlinear function of the coefficient estimates, increases by about 20%, from 2.258 to 2.639. The
received interpretation of this study is that the odds risk ratio is 2.639 for the gall bladder factor,
meaning that cancer occurs 2.639 times as often among persons having the gall bladder condition
than among those who do not, among the population from which the study sample is drawn. The
unmatched estimate, of 2.258 could be viewed as similar, perhaps, but the difference in its
estimate could affect serious decisions. In medical situations like this one, there are available
courses of treatment that would mitigate the risk of cancer, including options that are costly,
invasive, and having painful and/or uncertain side-effects. Doctors use odds ratio numbers to
consider offering treatment or costly additional testing options, or not, and patients, informed by
the doctors of the costs and the risks, must make serious decisions. If one is informed of a cancer
risk that is .4 times higher, one might make different choices about whether or not to pursue
options that could mitigate the risk.
Second, we replicate "Determinants of Actuarial Cost Method Changes for Pension
Accounting and Funding", the Accounting Review paper based on Dimitrios Ghicas’ University
of Florida Ph.D. dissertation. Ghicas (1990) is one of few papers in accounting research that both
employs matching in its sample selection and provides an explicit listing of its firm-year
23
22
observations with identification of its pairings, facilitating a replication. Ghicas (1990), applied
logistic regression to explain firms’ choices to switch actuarial methods for pension funding. The
choice to switch actuarial methods was explained by Ghicas as a function of factors that allowed
firms to subsequently report higher earnings. He was surprised to find no significance of size,
measured by assets, a likely proxy for the political exposure that might prevent more prominent
firms from raiding their pension funds. Because he pooled the case and control data from the 45
matched pairs, and omitted dummy variables for the pairings, his analysis suffers from errors 1
and 2.
Table 3 Panel A presents Ghicas’ reported results. Table 3 Panel B reports our unmatched
logit regression of the salient model that substantially matched Ghicas’ results. We then
reanalyzed the data with a pair-matched logit regression on the 86 pooled treatment and control
observations, by using conditional logit analysis. We compare the results to our conditional logit
results in Table 3, Panel C.
We found important differences. While there were no changes in sign between our
unmatched versus pair-matched analyses on the ten variables, the estimated statistical
significances of six of the ten variables shifted across traditional thresholds for strong and for
23
marginal significance (p-values of .01 and .10). The significance of two variables, IR and
LogTA, increased dramatically (p-values changing from .02 to .001, and from .08 to .02). Three
variables that were marginally significant in the unmatched analysis, WC, RUNI, and INT, were
now found not to be significant. One variable, CI, that was insignificant, became marginally
significant. The other four variables did not have significance level changes across the .01 or .10
significance level thresholds.
While Ghicas finds significant support for six of nine hypotheses (two with statistically
22 After contacting Ghicas and finding that his original data is no longer available, we substantially reconstructed
Ghicas’ dataset from Compustat, 10-K statements, and annual reports, without certitude of exactly replicating his
data because of ambiguities about fiscal year-ends and unavailable reports. Data for two pairs of observations
was especially problematic to reconstruct and thus were omitted from our analyses.
23 We would prefer to perform statistical tests for whether each coefficient is the same before and after,
individually, so as to say whether the change for each was statistically significant. This could be performed
easily in an ordinary regression context. But in the logit regression context, where the relative sizes are
determined but overall scaling is arbitrary, we do not see how to construct those tests. (Note, in a stacked
regression model, an included observation will have no influence upon a given model’s estimation if it has value
zero for each variable in that model. This is not true in a logit regression context.)
24
significant support and four with marginal support) we find significant support for only two of
Ghicas’ six hypotheses, and no support for the other hypotheses. Interestingly, the size measure,
logTA, now comes in strongly with .01 significance, while Ghicas’ discussion in hypothesis
development and footnotes suggested puzzlement on his part that this measure, a likely proxy for
political visibility, entered only marginally (at .11 in his and .08 in our unmatched analyses).
Also, we find marginal support in the “wrong” direction for one hypothesis for which he
expected to find support, but did not.
The differences in our analysis can largely be explained by the fact that the matching
controlled appropriately for factors related to industry and stock exchange listing. At least four of
the included variables are related to industry effects: leverage, working capital, and rate-ofundertaking-new-investments, and size. Hence, these variables are correlated with the industry
dummy variables omitted from the unmatched analyses. And, the omitted dummy variables
would be expected to affect the probability of pension method switching. Therefore the
unmatched analysis performed originally reflects an omitted correlated variables problem and its
estimated coefficients are biased and unreliable.
Ghicas believed that his analysis did control for his matching variables, and, citing
Palepu (1986) and Zmijewski (1984), he stated that “The primary advantage of logit models is
the presence of consistent coefficient estimates whenever choice-based sampling is involved”
(Ghicas 1990, p. 385.) This statement, although consistent with the general state of knowledge
within accounting research, failed to recognize the complication due to his further stratification
(of matched pairs) within 0-1 outcome sub-samples.
These replications show that reanalysis provides new insights in old data, and suggest
that previous choice-based and matched sample studies might best be reanalysed before being
relied upon further. In the next section we review other accounting studies and provide guidance
for future research.
5. Guidance for Statistical Analysis of Choice-Based and Matched
Samples
In this section, we seek to provide the guidance on choice-based and matched samples
needed by researchers going forward, informed by past usage and the potential for errors that we
have demonstrated. We discuss, first, how univariate analysis can suffer each of the three errors
25
that we identify and what are the corresponding remedies. We discuss the general solution that
reweighting analysis provides, to compensate for non-random sample selection, when exogenous
sampling rate data is available. Then we provide more specific guidance in each of the six
distinct research designs that we have identified as important. Finally we comment on
approaches for performing matching and on considerations in choosing between possible
research designs, in advance of data collection.
Guidance for univariate case
Univariate analysis has routinely been applied in choice-based and matched samples in
accounting research. Sometimes this has been the main analysis, but more often recently this is
preliminary to multiple variable analysis that will follow. Often this is done to compare two
samples on each variable, e.g. in a choice-based sample, to compare the treatment sample of
bankrupt firms and the control sample of non-bankrupt firms.
As noted in Section II, many researchers studied the use of matched sample t-tests in
introductory statistics textbooks such as Johnson and Bhattacharyya (1985) or Rice (1995).
These textbooks establish that a matched sample t-test (a one sample test) is more powerful than
an unmatched (two sample) t-test in detecting a mean difference in a given measure, provided
there is a natural pairing in the data. What is a natural pairing? The presented examples include
pairings that are within-subject, as in Rice’s example comparing a blood platelets aggregation
index measured in blood samples taken from each subject before and after smoking a cigarette.
Or the pairing is prospective and treatment is randomized. “In a medical experiment, for
example, subjects might be matched by age or weight or severity of condition, and then one
member of each pair randomly assigned to the treatment group and the other to the control
group” (Rice, 1995, p. 410.) Johnson and Bhattacharyya enjoin: “After pairing, the assignment
of treatments should be randomized for each pair” (p. 347). In these settings, the pairing is
shown to be useful if there is any positive pair-wise covariance in the univariate measure to be
compared, as that covariance will be subtracted from the sum of the variances of the measure
calculated within each of the two samples, yielding a lower unexplained variance, and hence a
higher ratio of mean difference divided by estimated standard error (t-value) can be discerned.
In all six research design cases, we deem an Error 1 to have occurred when a univariate
analysis fails to account for the matching, as when an unmatched t-test is applied rather than the
matched t-test that is justified. In univariate t-test analysis, the inappropriate test choice cannot
26
cause an actual reversal of results: the numerator in the t-test for a matched sample is the average
of pair-wise differences, which is mathematically the same as the numerator for an unmatched
sample, the difference of the two samples’ averages. (In multivariate settings, however, as
demonstrated in Section III, entirely opposite results may obtain when matched vs. unmatched
analyses are employed.) What differs is the estimated standard error of the difference, and
corresponding p-values of statistical significance for the null hypothesis of no difference. As the
researcher is usually correct in his judgment that there is some benefit in matching, the matched
sample t-test will be more powerful, statistically. And to be clear, use of the unmatched t-test is
invalid. It is especially inappropriate to report an unmatched t-test when the researcher wishes,
for some reason, to show no statistical difference between two samples.
We observe, further, that the situations described in statistics textbooks differ from many
of the non-random observational study settings common in accounting research in which
matched sample t-tests are often applied. The NCB-FM-W research design can be an exception.
Accounting studies may differ from those described by Johnson and Bhattacharyya (1985) and
Rice (1995) in that matching is sometimes by closest size or other measure, and hence Error 2 is
possible. When matching is by industry and closest log-size, for example, we argue that a
univariate t-test, matched or not, is not even appropriate. The residual difference in log-size
might explain pair-wise difference in the measure of interest as well or better than the group
membership does, in truth, but that would not be revealed in a matched t-test. The corresponding
analysis that would avoid Error 2 is an OLS regression of differences, in particular with the pairwise difference of the measure of interest regressed upon an intercept and pair-wise difference in
log-size (and perhaps more terms, such as pair-wise difference in log-size-squared). Significance
of the intercept would indicate a group membership effect on the measure of interest. (Or,
equivalently, the OLS regression to avoid Error 2 may be run on all non-differenced observations
but with the dependent variable being the measure of interest and independent variables being
the size variable or a size-difference variable and pair-identifier dummy variables, with omission
of one such dummy variable or of an overall intercept.) A quadratic term for the size or the sizedifference may also be accomodated in this formulation.
Accounting studies also differ from those described in the statistics textbooks in that
there is possibility of non-generalizability, of Error 3. Besides in the NCB-FM-W settings where
the samples arguably are randomly drawn, the application of a matched pair t-test towards
27
ascertaining a group difference, or of the OLS regression just described, does not yield any
generalizable result: there exists no larger population to which the analysis generalizes. In the
NCB-FM-W setting of before-and-after, assuming the cases are drawn randomly from all
continuing firms, the result generalizes to the population of all continuing firms. In the other five
settings, the match selection draws a non-random sample. The OLS implementation, but not the
univariate t-test, may be corrected by reweighting to avoid Error 3.
Guidance for reweighting
Reweighting provides for generalizability, and we explain how to apply it here.
However, this requires exogenous sampling information that will only be available if the
researcher focuses data collection effort, early in the research design process, on the wider
population to which analysis in a nonrandom sample is to generalize. To reweight a choicebased non-matched sample, the researcher must know the count of each outcome category in the
wider population. For example, to study bankruptcy across Compustat-listed firms by a choicebased sample of bankrupt and non-bankrupt firms, the researcher would effectively need to
determine whether or not the definition of bankruptcy is met, for all Compustat firms in any
strata that will be sampled. It does not suffice to examine the bankruptcy status for just the
sample firms. The researcher'
s choice of definition for bankruptcy then, is limited to those for
which information is universally available in the wider population to which the sample will
generalize. Likewise, to evaluate a non choice-based matched sample, the researcher must know
the count in the wider population of members of each possible matched set. For example, in an
industry matched sample, the number of Compustat firms in each industry category must be
known or collected, again limiting the researcher'
s choice of definition of industries to one that
can be determined universally. For a sample that is both choice-based and matched, the
researcher must know the count in all strata formed by intersection of outcome partitioning and
matched set partitioning. For example, in each separate industry category, the number of
bankruptcies and non-bankruptcies must be collected. In the analysis, each observation is to be
weighted by the inverse of the sampling rate for its stratum.
Consider a CB-FM setting such as bankrupt firms matched to non-bankrupt firms by
industry, with random selection out of the multiple possible matches (rather than by closest size,
eliminating, for simplicity, any possibility of Error 2). To examine whether there is a difference
between bankrupt firms vs. nonbankrupt firms in a given accounting ratio, it would be incorrect
28
to apply a univariate t-test. It is possible, however, to perform a correct analysis by running a
weighted OLS regression of the accounting ratio upon a group membership dummy variable plus
pair-identifier dummy variables (omitting an overall intercept or one pair-identifier). The weight
applied to each observation should be inversely proportional to the sampling rate for its stratum.
For example, let the weight be 1 for all the bankruptcies in one industry where all bankruptcies
available are selected in the sample. Then, the weight should be 2 for bankruptcies in a second
industry where only one-half of all available bankruptcies are randomly selected. And, within
each of these industries, the randomly selected non-bankruptcies should be weighted according
to the prevalence of non-bankruptcies in the industry. In the second industry, supposing there are
99 non-bankruptcies for each bankruptcy, and just one non-bankruptcy is chosen randomly for
each bankruptcy in the sample, the weight for each non-bankruptcy would be 198. If performed
appropriately, the weighted regression results provide a generally valid result of the association
of a single accounting ratio with bankruptcy.24
To be sure, in many situations the need for exogenous sampling information is onerous,
and, compared to collecting a random sample, the cost of collecting exogenous rate information
would often outweigh the advantage of being selectively strategic in taking a non-random sample
rather than a random one.25
Choice Based Non-Matched
More specifically, let us summarize what is our guidance for researchers, by research
design category. See Table 4 for a summary. If research employs a choice-based sampling,
without use of matching (CB-NM design), as 6 of the 73 audit research papers do, then of our
three errors only Error 3 can apply. Because the CB sample is non-random, either logit analysis
24
Our characterization may be over-simplified. Greene (2000), summarizing the Weighted Exogenous Sampling
Maximum Likelihood (WESML) estimator applied to a CB-NM case, states that weights are to be applied within
a weighted log-likelihood expression, i.e. as constants multiplying the logs of each observation’s individual
likelihood. Our CB-SM example might better be implemented in maximum likelihood form, which may differ in
implementation from how we describe it above. And, Greene notes complications in the estimation of the
covariance matrix requiring use of a special estimator.
25
Also, Greene (2004) notes use of a non-random sample and then reweighting is not a “free lunch”: “What the
biased sampling does, the weighting undoes. It is common for the end result to be very large standard errors,
which might be viewed as unfortunate, insofar as the purpose of the biased sampling was to balance the data
precisely to avoid this problem” (p.823). We note that the unbalanced sampling in choice-based studies does at
least ensure that observations having each outcome are represented in the sample, which random sampling would
not guarantee. Choice-based and matched samples do permit investigation of areas where data collection is
costly.
29
can be used, or reweighting is needed. Some might believe, incorrectly, that having the same
number of observations in each of two subsamples is needed, and discard valuable data from one
in order to even them up, but that is not necessary. If pair-wise or other matching is not
employed, it is not necessary to have equal sizes of case versus comparison samples, so all
available data should be used to enhance power in the analysis. (It is true that if additional
observations could be collected, it would generally be more beneficial to add data to the smaller
sized outcome group; in a narrow sense, equal sizing is most efficient. However data collection
decisions should be based on cost as well as benefit considerations, and it is not appropriate to
throw away available data for which no additional collection costs need be incurred,
unnecessarily.)
Choice-Based Fully Matched
It is more common now, however, for researchers using a choice-based design also to
choose to use matching (as Abdel-khalik and Ajinka (1979) recommend). If research employs
choice-based fully matched design (CB-FM) that yield equal sized pair-matched comparisons, as
27 of 73 do, all three errors can apply. To utilize the matching and avoid Error 1, pair-wise
differences can be taken or, equivalently, an intercept for each pair can be included in analysis
(with omission for one pairing or omission of an overall intercept, to permit estimation). Of the
27, 23 do suffer from this error; the four that avoided it do so by employing only univariate
pairwise tests (and then they cannot avoid suffering Error 3). For multivariate OLS regressions,
including pair-identifier intercepts would be an easy adjustment to avoid the Error 1 and to soak
up the variance in outcome that relates to the matching variables. Discriminant analysis of
matched sample data could be performed correctly by applying analysis to assess the location in
the independent variable space of pair-wise differences, and examining that for significant
deviation from the origin, as described earlier. Other statistical methods may or may not lend
themselves to controlling for matching that found locations of outcome groups in the space of the
independent variables. To avoid Error 2, which 21 of the 27 fail to do, researchers must include
the imperfectly matched variable in the analysis (e.g. include size or size2 as a control variable in
a regression analysis). To avoid Error 3, either logit regression or reweighting needs to be
applied. Again, for discriminant analysis, we are not aware of any approach that could
implement the necessary reweighting.
Choice-Based Semi-Matched
30
If research employs a choice-based semi-matched (CB-SM) design, as 23 of 73 do, then
Error 1 and Error 3 may easily apply. Error 2 is conceivable, but not often seen. This is the
research design illustrated in simulations 1, 2, and 3 above. Unlike for the CB-FM design, here it
is not most appropriate to perform the analysis upon pair-wise differences, even when there is a
nominal pairing that could be used for that purpose, as in the simulation. (Note, for semimatched analysis there does not need to be an equal number of each outcome collected in each
stratum. But if an equal number of outcome 1’s and outcome 0’s are collected in each group, as
was done in the collection of 50 pairs in the simulation, a researcher might be inclined to take
pairwise differences.) Instead, the most efficient way to control for matching is by including just
one stratum identifier for each defined stratum (e.g. industry), and hence to pool together the
nominal pairings that appear within each stratum. In the simulation, then, effectively, just five
industry intercepts are to be estimated, rather than 50 or 100 or 1600 pairwise intercepts, and
greater efficiency is achieved. In our review of audit papers, we identify as Semi-matched a
number of studies where the researchers themselves present the work as if it were Fully-matched,
because it was apparent to us that the researchers’ pairings included groups of pairs that could
best be pooled together, whether or not the researchers perceived it that way. Again, to achieve
generalizability of results and to avoid Error 3, logit regression must be employed in the analysis
or explicit reweighting is needed.
NonChoiceBased Fully Matched Between Subjects
If a researcher employs nonchoice-based fully matched (NCB-FM), as 11 of the 73 audit
research papers do, then it is possible that the matching is between-subjects (NCB-FM-B, 6) or
within-subjects (NCB-FM-W, 5). If the former, all three errors could possibly apply. To avoid
Error 1, pairing must be accounted for by analysis based on differences or otherwise fully
saturated. To avoid Error 2, imperfection in matching must be controlled for by including terms
for the imperfectly matched variables in the analysis. To avoid Error 3, if logit regression is not
employed, reweighting must be applied.
NonChoiceBased Fully Matched Within Subjects
An example within-subjects experimental design would be a study of firms’ audit fees
compared before and after some event. Here, the sampling is not choice-based but it is matched
(e.g. a firm-year observation before an event, to the same firm’s firm-year observation after the
event). In within-subjects designs, if the subjects are themselves chosen randomly, then there is
31
no issue of non-random selection that would require reweighting to strata proportions. So withinsubjects studies are not subject to Error 3. Since there may be sample selection issues that come
up in an audit fees within-subjects study (e.g. firms which undergo mergers may drop from the
audit fee study), a selection bias concern can rule out the widest generalizations, but the withinsubjects sample results can fairly be generalized at least to the larger population of continuing
firms. Error 2 also is not possible; there is a perfect matching of each subject (before) with itself
(after). To avoid Error 1, pairing must be accounted for in the analysis.
NonChoice-Based Semi-Matched
If research employs nonchoice-based semi-matched (NCB-SM), as 6 of the 73 papers do,
analysis should proceed as with the NCB-FM-B, but for the use of fewer intercept dummies
(condensing pairs within the same stratum into pools, to have just one intercept estimated more
accurately).
Some Considerations in Match Selection
Better matching variables are those which are not of research interest but which are
believed to explain variation in outcome and which are cheap to gather, before data of final
samples is to be collected.
We observe many examples of matching by closest size, but where it is log-size that is
deemed the appropriate form of the variable to include in analysis (e.g. Lys & Watts 1994). It
would be more appropriate to use closest log-size in the match selection initially.
To perform the matching, on whatever matching variables have been chosen, use
software that provides an audit trail and implements a consistent approach. We recommend
Kosanke and Bergstrahl'
s SAS macro for selecting matched sets, described in Bergstrahl,
Kosanke, Jacobsen (1991). Parsons (2002) provides an alternative SAS macro that we have not
evaluated.
There are other perspectives, but we find Heckman, Ichimura and Todd (1998)’s
arguments against the use of propensity matching to be compelling.
Again, if analysis is to incorporate reweighting, match selection must be applied using
only variables for which the strata-level and universal rates are known.
Some Considerations in Choosing Research Designs
In advance of data collection, we recommend use of matched and choice-based sampling
plans only when there exists relatively inexpensive-to-gather candidate matching variables and
32
when other control variables and/or the variables of research interest are expensive to collect.
Then, a strategically selected matched sample can be more powerful than a similar sized random
sample. If data collection costs are not high, a random sample is preferred. If data collection
costs are intermediate, it is preferred to collect more than one match for each case observation;
the analysis as semi-matched rather than fully matched can easily accomodate the additional
information.
6. Summary and Conclusions
Technical errors in the analysis of non-random samples runs through accounting research.
Controls for matching are not included, although needed (Error 1). As we showed in simulation,
incorrect conclusions can then be reached. A lesser error is failure to evaluate the potential effect
of imperfect matching (Error 2). Where logit exemption to a need for reweighting does not
apply, then WESML or other reweighting is needed but typically is not applied (Error 3), so
presented results are not in fact generalizable.
Our main finding from our review of accounting research is that the vast majority of
choice-based and matched papers suffer one or more of the three technical deficiencies. Of the
73 audit research papers we reviewed, 55 need to but do not explicitly control for matching in
their analysis, and thus suffer from Error 1. Of the remainder, 6 are choice-based samples but not
matched samples (so control for matching is not needed). Only 12 of the 73 are matched samples
where researchers correctly controlled for the matching. Our most urgent guidance to researchers
then is to either avoid use of matching, or to take the matching into account when analyzing the
data. If matching is not taken into account, by either evaluating pairwise differences, or by
including dummy variables for each matched set, then the research should not be accepted by the
field.
On the second technical criticism we make, which can only apply to the 38 fully matched
research designs within the 73 audit research papers, we note 30 of 38 papers suffer from lack of
explicit control for “closest” imperfect matching. We advise researchers either to avoid imperfect
matching, or to perform and report sensitivity analyses on how imperfection in the matching
might have influenced outcomes. A closest-matched variable such as size can still have
influence. It might be controlled for by including a linear term. But, as size or another variable’s
contribution might be non-linear, in general, there is no fully satisfactory resolution. The
33
researcher, we argue, must make some effort to examine the possibility that all results are driven
by the omitted effect. Sensitivity analyses including linear and quadratic terms, for example,
might be performed and discussed. Otherwise, the researcher has not established that other
reported effects are not merely the result of an omitted variable problem.
Our third technical criticism, the need for reweighting, can apply to only 42 of the 73
papers. We note that 40 of these 42 are in error for performing statistical analysis without
necessary reweighting. A large fraction of the other papers, 25 of the 73, can be regarded as
largely exempt from the need to perform reweighting because they used logit regression. When
logit is used, inferences based on non-intercept coefficients can be correct, assuming there are
not other methodological problems present. (However, of these 25 papers, only the 4 that
avoided use of matching do not themselves suffer from Error 1.). Only two of the papers applied
an explicit reweighting. To avoid this criticism, we suggest that choice-based and matched
sampling should be avoided unless explicit sampling rate information can be obtained (allowing
for explicit reweighting) or unless logit regression will suffice to analyze the research questions
(taking advantage of the logit exemption to the need for reweighting).
How important are the errors we identify for the course of accounting research? It is
possible that research streams have been misdirected due to mistaken identification of effects
that are not true, or due to the mistaken findings that are reversed from true effects. We suspect
that the most common effect may be the failure to find effects that in fact appear to be true, as
illustrated in our replication of Ghicas (1990).
Burgstahler (1987) argues that if tests reported in the accounting literature are
characterized by low power (as would be the case for misanalyzed choice-based and matched
samples) and high effective levels (i.e. if there is a bias to publish significant findings), the
results of published tests properly should have little or no impact on the beliefs of a Bayesian. It
is common knowledge that papers with no result (finding insufficient evidence to reject null
hypotheses) are much less likely to be published; Greenwald (1975) discusses such publication
bias and its consequences. This supports a severe view that over a very long period, accounting
research involving misanalysed non-random samples should be disregarded.
We hope that researchers recognize new opportunities for research projects from this
work. First, there are many opportunities to reconsider published results in audit research and
other areas where choice-based and matched samples have been used. Many researchers might
34
now salvage studies unpublished previously for reason of unexplainable anomalies or for lack of
statistically significant results. And future work may now exploit greater-than-previouslyunderstood power of choice-based and matched sampling methods, when correctly analysed, in
appropriate settings.
35
References
Abdel-khalik, A. R. and B. B. Ajinkya. 1979. Empirical Research in Accounting: A Methodological Viewpoint. Sarasota, FL: American Accounting
Association.
Abrevaya, J. 1996. The Equivalence of Two Estimators of the Fixed-Effects Logit Model. Economics Letters 55:41-43.
Agresti, A. 2002. Categorical Data Analysis. 2nd edition. Hoboken, NJ: John Wiley & Sons, Inc..
Altman, E. I. 1968. Financial Ratios as Predictors of Failure. Journal of Finance 23(4): 589-609.
Anderson, J. A. 1972. Separate Sample Logistic Discrimination. Biometrika, 59:19-35.
Barber, B. M. and J. D. Lyon. 1986. Detecting Abnormal Operating Performance: The Empirical Power and Specification of Test Statistics. Journal
of Financial Economics 41(3): 359-399.
Barber, B. M. and J. D. Lyon. 1987. Detecting Long-run Abnormal Stock Returns: The Empirical Power and Specification of Test Statistics. Journal
of Financial Economics 43(3): 341-372.
Bartov, E., F. A. Gul, and J. S. L. Tsui. 2001. Discretionary-Accruals Models and Audit Qualifications. Journal of Accounting and Economics 30:
421-452.
Beaver, W. H. 1966. Financial Ratios as Predictors of Failure. Empirical Research in Accounting: Selected Studies, 1966, supplement to Journal of
Accounting Research: 71-111.
Beaver, W. H. 1968. Market Prices, Financial Ratios, and the Prediction of Failure. Journal of Accounting Research 6(2): 179-192.
Bhojraj, S. and C. M. C. Lee. 2002. Who is my peer? A valuation-based approach to the selection of comparable firms. Journal of Accounting
Research 40: 407-439.
Breslow, N. E. 1996. Statistics in Epidemiology: The Case-Control Study. Journal of the American Statistical Association, 91(433):14-28.
Breslow, N. E. and N. E. Day. 1980. Statistical Methods in Cancer Research: Volume 1--The Analysis of Case-Control Studies, and Volume II—The
Design and Analysis of Cohort Studies. The International Agency for Research on Cancer, Lyon, France.
Breslow, N. E. and N. E. Day. 1987. Statistical Methods in Cancer Research: Volume II—The Design and Analysis of Cohort Studies. The
International Agency for Research on Cancer, Lyon, France.
Burgstahler, D. 1987. Inference from Empirical Research. The Accounting Review 62 (1): 203-214.
Burgstahler, D., J. Jiambalvo, & E. Noreen. 1989. Changes in the Probability of Bankruptcy and Equity Value. Journal of Accounting & Economics
11 (2,3): 207-224.
Campbell, D. T. and J. C. Stanley. 1966. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally & Company.
Deakin, E. B. (1972). A Discriminant Analysis of Predictors of Business Failure, Journal of Accounting Research, Spring: 167-179.
Dietrich, J. 2001. The Effects of Choice-Based Sampling and Small-sample Bias on Past Fair Lending Exams. Working paper, Office of The
Comptroller of The Cuurency, Department of The Treasury, Washington, DC.
Dopuch, N., R. W. Holthausen, and R. W. Leftwich. 1987. Predicting Audit Qualifications with Financial and Market Variables. The Accounting
Review LXII (3): 431-454.
Ghicas, D. 1990. Determinants of Actuarial Cost Method Changes for Pension Accounting and Funding. The Accounting Review. April 384-405.
Giles, J. A., and M. J. Courchane. 2000. Stratified Sampling Desing for Fair Lending Binary Logit Models. Working paper.
Greenwald, A. G. 1975. Consequences of Prejudice Against the Null Hypothesis. Psychological Bulletin 82(1): 1-20.
Harrison, T. 1977. Different Market Reactions to Discretionary and Nondiscretionary Accounting Changes. Journal of Accounting Research 15(1):
84-107.
Heckman, J. J. 2002. Unpublished notes.
Heckman, J. J., H. Ichimura, and P. Todd, (1998). Matching as an Econometric Evaluation Estimator. The Review of Economic Studies, 65(2): 261294.
36
Hillegeist, S., E. K. Keating, D. P. Cram, and K. G. Lundstedt. 2004. Assessing the Probability of Bankruptcy. Review of Accounting Studies 9: 5-34.
Hosmer, D. W. and S. Lemeshow. 1988. Applied Logistic Regression. NY: John Wiley & Sons.
Johnson and Bhattacharyya. 1985. Statistics: Principles and Methods. NY: John Wiley & Sons.
Kerlinger. 1973. [cited in Abdel-Khalik and Ajinkya, details to be added]
Kinney, W. R. 1986. Empirical Accounting Research Design for Ph.D Students. The Accounting Review 61(2): 338-350.
Kothari, S. P., A. J. Leone and C. E. Wasley. 2005. Performance Matched Discretionary Accrual Measures. Journal of Accounting and Economics,
39:1.
Lys, T. and R. Watts. 1994. Lawsuits Against Auditors. Journal of Accounting Research 32 (Supplement): 65-93.
Mack, T. M., M. C. Pike, B. E. Henderson, R. I. Pfeffer, V. R. Gerkins, M. Arthur, and S. E. Brown, 1976. Estrogens and Endometrial Cancer in a
Retirement Community. The New England Journal of Medicine. 294: 1262-1267.
Maddala, G. S. 1991. A Perspective on the Use of Limited-Dependent and Qualitative Variables Models in Accounting Research. The Accounting
Review 66 (4): 788-807.
Maddala, G. S. 1996. Applications of Limited Dependent Variable Models in Finance. Handbook of Statistics, 14: 553-566.
Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press.
Manski, C. F. and S. R. Lerman. 1977. The Estimation of Choice Probabilities from Choice Based Samples. Econometrica 45 (November): 1977-88.
Manski, C. F. and D. McFadden. 1981. Alternative Estimators and Sample Designs for Discrete Choice Analysis, in: C. F. Manski and D. McFadden,
eds., Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA.
McNichols, M. and A. Dravid. 1990. Stock dividends, Stock Splits, and Signalling. Journal of Finance 45 (July): 857-79.
Palepu, K. G. 1986. Predicting Takeover Targets: a Methodological and Empirical Analysis. Journal of Accounting and Economics 8: 3-35.
Parsons, L. S. 2002. Reducing Bias in a Propensity Score Matched-Pair Sample Using Greedy Matching Techniques. SAS Users’ Group Conference
Proceedings. Paper 214-26.
Prentice, R. L. and R. Pyke. 1979. Logistic Disease Incidence Models and Case-Control Studies. Biometrika 66(3): 403-11.
Rice, J. A. 1995. Mathematical Statistics and Data Analysis. 2nd edition. Belmont, Ca.: Duxbury Press, International Thomson Publishing.
SAS Version 9.1, 2003. Software Documentation, SAS/STAT PROC LOGISTIC, Example 42.10.
Schlesselman, J. J. 1982. Case-Control Studies: Design, Conduct, Analysis. Oxford University Press, Oxford.
Scott and Wild. 1991. Fitting Logistic Models in Stratified Case-Control Studies. Biometrics 47: 497-510.
Smith, M. 2003. Research Methods in Accounting. Sage Publications.
Tatsuoka, M.M. 1971. Multivariate Analysis: Techniques for Educational and Psychological Research. New York: John Wiley & Sons, Inc.
Wallace, J. S. 1997. Adopting Residual Income-Based Compensation Plans: Do You Get What You Pay For? Journal of Accounting and
Economics. 24: 275-300.
Zmijewski, M. 1984. Methodological Issues Related to the Estimation of Financial Distress Prediction Models. Journal of Accounting Research 22:
59-82.
37
Figure 1: Research Design Categories for Choice Based and Matched Samples
Choice Based
Matched
NCB-FM-W
CB-NM
CB-SM
CB-FM
NCB-SM
NCB-FM-B
Fully Matched
Count in
Category
Example
Audit Research
CB-NM
Choice Based Non-Matched
Palepu (1986)
CB-SM
Choice Based Semi-Matched
Henninger (2001)
23
CB-FM
Choice Based Fully Matched
Lys and Watts (1994)
27
NCB-FM-W Non Choice Based Fully Matched
6
Teoh and Wong (1993)
5
Iyer and Iyer (1996)
7
Krishnan (2003)
5
Within-Subject
NCB-FM-B Non Choice Based Fully Matched
Between Subject
NCB-SM
Non Choice Based Semi-Matched
Total
38
73
Simulation 1: True
50 pr
100 pr
200 pr
400 pr
800 pr
1600 pr
All Data
βˆ1
βˆ 2
1 =1, 2=1, correlation=.4
-0.083
1.562***
-0.004
1.441***
0.045
1.208***
0.062**
1.324***
0.083***
1.250***
0.077***
1.137***
0.086***
1.204***
Conditional Logit Using
Pairings
βˆ1
0.886**
0.833***
0.956***
1.159***
1.032***
0.989***
N.A.
βˆ 2
1.898***
1.480***
1.131***
1.348***
1.191***
.966***
N.A.
Conditional Logit Using
“Industry” Groupings
βˆ1
0.668**
0.802***
0.906***
0.967***
1.004***
0.967***
0.977***
βˆ 2
Graphical Presentation of Final
Coefficient Estimates
Simulated Data in Five Industry Groupings
1.622***
1.413***
1.106***
1.209***
1.096***
0.949***
1.019***
Simulation 1
Est Beta2
Unmatched Analysis
Unmatched (.09, 1.2)
With Industry
Fixed Effects
True (1,1)
Strata on Pairs
(0,0)
50 pr
100 pr
200 pr
400 pr
800 pr
1600 pr
All Data
1 =1, 2=0, correlation=.4
0.075
0.131
0.081
0.424***
0.089**
0.423***
0.093***
0.337***
0.090***
0.318***
0.082***
0.328***
0.087***
0.303***
0.772***
0.865***
0.959***
0.926***
0.972***
0.988***
N.A.
-0.117
0.150
0.106
0.051
0.059
0.079*
N.A.
0.854***
0.857***
0.944***
0.910***
0.935***
0.947***
0.986***
-0.109
0.193
0.127
0.065
0.050
0.062
-0.002
Simulation 2
Est Beta2
Simulation 2: True
Est Beta1
Unmatched (.09, .3)
Strata on Pairs
True (1,0)
With Industry Fixed Effects
Est Beta1
50 pr
100 pr
200 pr
400 pr
800 pr
1600 pr
All Data
1
= -4,
2=1,
-0.164**
-0.151***
-0.162***
-0.164***
-0.160***
-0.157***
-0.149***
correlation=.4
-0.207
-0.328**
-0.198**
-0.175**
-0.217***
-0.254***
-0.180***
-4.749***
-4.877***
-4.485***
-3.829***
-4.021***
-4.192***
N.A.
0.932
0.801*
0.810***
0.917***
0.869***
0.865***
N.A.
-5.945***
-4.070***
-4.161***
-4.129***
-3.926***
-4.013***
-3.946***
0.641
0.529*
0.880***
1.025***
0.902***
0.905***
0.996***
Simulation 3
True (-4,1)
With Industry
Fixed Effects
Strata
on Pairs
Est Beta1
Unmatched
(-.15, -.2)
*,**,***=significant at .10, .05, .01 level
Table 1: Comparison of Methods Applied to Simulated Data
Est Beta2
Simulation 3: True
Table 2
Logit Regressions Analyzed in Replication of Mack et al (1976) Replication
Panel B
Panel A
SAS Reported
Results
Coefficient
P-value
Panel C
Replication:
Conditional Logit
Coefficient
P-value
Intercept
Replication:
Unmatched Analysis
Coefficient
P-value
-0.3013
0.2242
Gall
.9704
0.0675
.9704
0.0675
0.8417
0.0770
Hyper
.3481
0.3558
.3481
0.3558
0.3682
0.3264
Odds Ratio
Odds Ratio
Odds Ratio
Gall
2.639
2.639
2.258
Hyper
1.416
1.416
1.445
low
high
low
high
low
high
Gall
.933
7.468
.933
7.468
.915
5.571
Hyper
.677
2.965
.677
2.965
.693
3.015
N
63 pairs
126
Panel A: SAS Version 9.1 (2003) Example 42.10: Conditional Logistic Regression for Matched Pairs
Data.
Panel B: Data analyzed using conditional logit matches SAS Example 42.10 of Panel A, exactly.
Panel C: Reanalysis using unmatched analysis. Apparent finding is that Gall variable is significant,
Hyper is not, as before. However, significance is slightly lower. And, Odds Ratio for Gall is estimated
at 2.258, 40% lower.
Gall = Gall bladder condition
Hyper = Hypertension condition
Bold = significant at .10 level
40
Table 3: Logit Regressions Analyzed in Ghicas (1990) Replication
Panel A
Panel B
Ghicas’ Reported
Results
Coefficient
Intercept
P-value
Panel C
Replication: Unmatched
Analysis
Coefficient
P-value
Replication: Conditional
Logit
Coefficient
P-value
-2.880
0.31
0.713
0.76
PR
1.036
0.06
0.210
0.39
0.250
0.35
IR
0.785
0.01
0.484
0.02
0.735
0.00
LEV
3.632
0.12
1.698
0.40
1.716
0.53
WC
-1.788
0.03
-1.188
0.10
-0.847
0.28
-31.547
0.01
-21.983
0.06
-4.528
0.61
14.43
0.01
9.376
0.09
1.090
0.74
-0.845
0.11
-0.378
0.08
-0.726
0.01
1.625
0.20
0.098
0.69
0.160
0.77
-0.338
0.46
0.195
0.20
0.189
0.09
3.109
0.34
0.752
0.70
0.679
0.75
RUNI
INT
ln (TA)
ETR
CI
FFO
N
90
86
86
Panel A: Ghicas (1990), Table 4, pp 397.
Panel B: We reconstructed Ghicas’ data to the extent possible, yielding 43 usable pairs. Results of the unmatched
logit regression show the results to be substantially similar.
Panel C: Data analyzed using conditional logit shows changes in results. Results for variable IR were now found to
be highly significant at a .001 level. Results for variables WC, RUNI, and INT now show no significant effects.
Variable LOGTA, a proxy for political visibility, and central to Ghicas’ hypotheses, was not found by Ghicas to be
significant. However, conditional logit regression shows this variable to be significant at a .02 level.
PR = Pension Assets / Pension Liabilities
IR = Interest Rate, used for the computation of pension liabilities per SFAS No. 36
Bold = significant at .10 level
Variable definitions, with number indicating Compustat item numbers: LEV = Long-term Debt, #9 / (Total assets, #
6 - Intangible assets, # 33); WC = Current Assets, #4 / Current Liabilities, $5; RUNI = (Capital Expenditures, #128
+ Acquisitions, #129 + Advertising, #45 + R&D, #46) / Total Assets, #6; INT = WC * RUNI; TA = Total Assets,
#6 (in millions); ETR = (Tax Expense, #16 - Change in Deferred Taxes, #35) / Funds Flow from Operations, #35;
CI = (It - It-1)/ It-1, It = Income for year t before extraordinary items and discounted operations, #18; FFO = Funds
Flow from Operations, #110 / Sales, #12
WC = Current Assets, #4 / Current Liabilities, $5; RUNI = (Capital Expenditures, #128 + Acquisitions, #129 +
Advertising, #45 + R&D, #46) / Total Assets, #6; INT = WC * RUNI; TA = Total Assets, #6 (in millions); ETR =
(Tax Expense, #16 - Change in Deferred Taxes, #35) / Funds Flow from Operations, #35; CI = (It - It-1)/ It-1, It =
Income for year t before extraordinary items and discounted operations, #18; FFO = Funds Flow from Operations,
#110 / Sales, #12.
41
Table 4: Description of Research Design and Potential Errors
CB-FM
27 papers
Treatment
Control
Group
Group
Selected on
basis of
outcome
One firm selected as match for
each firm in treatment group from
set of firms having similar
characteristics by matching on
“closest” values
in audit
research
sample
CB-SM
Selected on
basis of
outcome
Randomly selected from firms not
having the same outcome, but
matching by industry, year, size
or group level
23 papers
CB-NM
Selected on
basis of
outcome
Randomly selected from firms not
having the same outcome
Error
1
Error
2
√
√
23
21
18 26*
√
√
√
21
0
8 22*
√
N/A
N/A
4
Randomly
selected
sample of
firms
Same subject, usually before and
after
√
N/A
2
Randomly
selected
sample of
firms
One firm selected as match for
each firm in treatment group from
set of firms having similar
characteristics by matching on
√
If logit, run conditional logit with pairs
identified as strata.
If OLS, include dummy variables for pairs, and
reweight each observation for its sampling rate
(i.e., apply WESML).
If logit, run conditional logit with groups
identified as strata.
If OLS, include dummy variables for groups
and apply WESML.
If logit, run regular logit, and only the intercept
is biased.
If not logit, apply WESML.
If OLS, include pair-identifier dummies or
analyse as differences-on-differences.
N/A
5 papers
NCB-FM-B
Selected Guidance
√
6 papers
NCB-FMW
Error 3
If MANOVA, block on subject.
Univariate comparisons okay.
WESML not required.
√
√
If OLS, include pair-identifier dummies and
linear (and perhaps more) terms for imperfectly
matched variables, or analyse as differenceson-differences including differences of
42
“closest” values
7 papers
4
NCB-SM
5 papers
Total: 73
Randomly
selected
sample of
firms
Randomly selected from firms not
having the same outcome, but
matching by industry, year, size
or group level
6
7
imperfectly matched variables. WESML
required.
If OLS, include group dummies.
√
√
√
If MANOVA, block on groups.
5
0
5
WESML required.
55
27
42 64*
Error 1: Count of audit papers that use unconditional analysis, when analysis conditional upon effects of matching variables is needed
Error 2: Count of audit papers that fail to control for effect of imperfectly matched variables
Error 3: Count of audit paper that fail to reweight observations according to appropriate sampling rates
*Count of audit papers suffering Error 3, not including logit papers with unsaturated models Count including logit papers with unsaturated
models.
43
Appendix A
This appendix provides proof that coefficient estimates in a logit regression on one-toone matched pairs data are correctly analysed either by a no-intercept logit regression on pairwise differences, or, equivalently, by a pooled no-intercept logit regression having dummy
variables indicating pair memberships. It follows that an unmatched pooled logit regression (as
has been routinely employed in practice) is misspecified.26 Specifically, the proof establishes
that the relative magnitudes of coefficients are estimated correctly each way. The
corresponding standard errors and p-values, however, are estimated correctly only by the
former approach, but proof is not herein provided. (See Abrevaya (1996) for explication of
this complication.)
Suppose that a population of data exists where the following logistic model relationship
is true:
y = 1 if α j + xβ + ε > 0 ; y = 0 otherwise
where
j
is an intercept specific to the jth of J strata in the population, xβ is the vector product
of coefficients and independent variables and is a logistic distributed error term having mean
0 and variance
2
. The jth stratum is a subpopulation having uniform measurements on
industry and size or other combination of factors that influence the outcome through j. From
the population, suppose that n matched pairs of data are randomly selected without
replacement, i.e., where each pair of observations are matched only in that they are selected
from within the same stratum. Note, the selection is not outcome-based; both outcomes in a
pair might be the same. Denote the sample data as follows:
{( y11 , x11 , y12 , x12 ), ( y 21 , x 21 , y 22 , x 22 ),..., ( y n1 , x n1 , y n 2 , x n 2 )}
where yi1 and yi2 denote the paired 0-1 outcomes for the ith matched pair and vectors xi1 and xi2
26
The proof follows Agresti (2002)’s notation and suggestions for extension from a simpler
setting that he presents. The basic result for coefficients is attributed to Anderson (1972) and
extended by Prentice and Pyck (1979) and others; see Breslow (1996) for a review.
44
denote the corresponding sets of values of the explanatory variable(s). The likelihood function
for the sample is as follows, from which the proof will follow directly:
n
L* = ∏
i =1
exp(α i + xi1 β )
1 + exp(α i + xi1 β )
yi 1
1
1 + exp(α i + xi1 β )
1− yi 1
exp(α i + xi 2 β )
1 + exp(α i + xi 2 β )
yi 2
1− yi 2
1
1 + exp(α i + xi 2 β )
[A1]
where L* denotes the likelihood of the sample, and the ith expression in square brackets
expresses the probability that the ith pair would have the outcomes that are observed for it.
The likelihood expression provides for intercepts i, i=1 to n, and coefficient vector and may
be maximized to yield maximum likelihood estimates for these parameters directly using
iterative search methods. Note this is in the form of the estimation of a pooled logit regression
with an intercept/dummy variable for each pair, and hence we have shown that logit regression
with dummy variables is appropriate for the assumed matched sample setting. Maximizing the
the expression A1 will yield coefficient estimates that are correct in their relative magnitudes;
the scaling in logit software implementations is arbitrarily chosen to fix the estimated standard
error of unobservable ε to equal one.
As the purpose of estimation is to determine β , it is useful to note that there is no
information about
available in pairs where both outcomes are 1 or both outcomes are 0,
because the distribution of ( y i1 , xi1 , y i 2 , xi 2 ) depends on β only when the pairwise success total
S ≡ y i1 + yi 2 equals one. If S=0 or S=2, the value of αi may be set arbitrarily large or small,
depending on the sign of β, so that the ith pair’s contribution to the expression above is
arbitrarily close to one, hence varying βˆ in maximum likelihood estimation searching will not
affect that pair’s contribution to the likelihood.
Restricting ourselves then to pairs having opposing outcomes, i.e. where Si=1, we can
write:
P(Yi1 = 0, Yi 2 = 1 | S i = 1) + P(Yi1 = 0, Yi 2 = 1 | S i = 1) = 1
and
45
P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) =
P(Yi1 = y i1 , Yi 2 = y12 )
P(Yi1 = 0, Yi 2 = 1) + P(Yi1 = 0, Yi 2 = 1)
Expanding the last expression out to reflect the influence of independent variables, using the
usual logit formulae, we obtain:
P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) =
exp(α i + xi1 β )
1 + exp(α i + xi1 β )
yi1
1− yi1
1
exp(α i + xi 2 β )
1 + exp(α i + βxi 2 β )
1 + exp(α i + xi1 β )
exp(α i + xi1 β )
1 + exp(α i + xi1 β )
1
1 + exp(α i + xi1 β )
+
1
1 + exp(α i + xi 2 β )
yi 2
1− yi 2
1
1 + exp(α i + xi 2 β )
exp(α i + xi 2 β )
1 + exp(α i + βxi 2 β )
[A2]
The expression is the probability that the sample outcomes would be observed, given that
opposing outcomes are observed. Without loss of generality, we can reorder within any pairs
where necessary so that it is the first outcome in the pair that is zero, i.e. so
that yi1 = 0 and y i 2 = 1 . Then [A2] above simplifies to:
P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) =
1
1 + exp(α i + xi1 β )
exp(α i + xi1 β )
1 + exp(α i + xi1 β )
1
1 + exp(α i + xi1 β )
exp(α i + xi 2 β )
1 + exp(α i + β xi 2 β )
+
1
1 + exp(α i + xi 2 β )
.
exp(α i + xi 2 β )
1 + exp(α i + xi 2 β )
And the above simplifies to:
=
exp(α i + xi 2 β )
exp(α i + xi1 β )
exp(α i + xi 2 β )
1+
exp(α i + xi1 β )
=
exp((α i + xi 2 β ) − (α i + xi1 β ))
.
1 + exp((α i + xi 2 β ) − (α i + xi1 β ))
[A3]
Now, in one more algebraic step we see that the pair-identifier intercepts can be dropped out,
as this further simplifies to:
=
exp(( xi 2 − xi1 ) β
1 + exp(( xi 2 − xi1 ) β )
[A4]
46
Note, this is in the form of a logistic regression across pairs i, with no intercept and with
predictor values xi* = xi 2 − xi1 , and artificial response y i* = 1 for every observation. Thus, we
have proven the equivalence between the no-intercept logit regression of pair-wise differences
in outcome (all 1’s) on differences in explanatory variables, and the pooled logit regression
including an intercept/dummy variable for each pair, because as noted above the estimation
may be performed directly on [A1]. Maximizing A4 yields coefficient estimates that are the
same in relative magnitudes as maximization of A1. In practice, however, software
implementation with dummy variables as in A4 will yield coefficient estimates that are twice
as large as in implementing A1, and will report corresponding standard errors and p-values that
are incorrect. (Again, Abrevaya (1996) provides explanation.) The correct software
implementation is applied by SAS software’s PROC LOGISTIC with use of its STRATA
statement, or by STATA software’s CLOGIT command.
Appendix B
This apprendix provides proof that choice-based matched sampling requires modification from
usual estimation methods generally, but that for the binary, ordered, or multinomial logit
regression setting the estimation may be analysed by the usual logit regression provided a fully
saturated model is employed.27
First, let us consider random sampling and thereafter how non-random sampling and
analysis differs. The likelihood function in a random sampling scheme is:
I
L = ∏ f (Y i , X i , Z i )
i =1
I
= ∏ ( f (Y i | X i , Z i , β )h ( X i , Z i )
i =1
27
This understanding is due to Scott and Wild (1991). This proof is informed by unpublished class lecture notes by
Heckman (2002) that addressed a simpler case.
47
where Y is the dependent variable, X is research variables, and Z is nuisance variables, and f
and h are joint density functions. Or in logarithm form
ln L =
I
ln f (Yi | X i , Z i , β ) +
i =1
I
i =1
[B1]
ln h( X i , Z i )
First order conditions for estimation are found by differentiating with respect to and setting
the result equal to zero:
∂ ln L
=
∂β
I
i =1
∂ ln f (Yi | X i , Z i , β )
=0
∂β
[B2]
Note that in (2) the second summation term in (1) has dropped out, due to exogeneity, given
random sampling.
In the case of binary logistic regression, where Y_is constrained to 1’s and 0’s and f is
the logistic density function, this simplifies to:
∂ ln L
=
∂β
I
i =1
( yi −
exp( x i β )
) xi = 0
1 + exp( x i β )
[B3]
It may be instructive to observe that for x i including a constant, that (B3) implies the average
of predicted probabilities must equal the proportion of 1’s in the sample.
For an endogenous sampling scheme, instead, such as for choice-based sampling, the
likelihood function is:
I
L = ∏ f (Yi | X i , Z i , β )h( X i , Z i )
i =1
C ( yi )
g ( yi , Z i )
and g(y_i) is is the sampling rate for the given outcome y_i
ln L =
I
i =1
ln f (Yi | X i , Z i , β ) +
I
i =1
ln h( X i , Z i ) +
I
i =1
ln C (Yi ) −
I
i =1
ln g (Yi )
[B4]
and the first order conditions are
48
∂ ln L
=
∂β
I
i =1
∂ ln f (Yi | X i , Z i , β )
−
∂β
I
i =1
∂ ln g (Yi )
=0
∂β
[B5]
Note, estimators using just the first term in (B5), as is done for random sampling, in general
will be biased under choice-based sampling. It would be possible under any maximum
likelihood estimation approach to use explicit sampling rates as here. We will deduce that in
the case of binary logistic regression on a matched sample, however, the second term is zero
for each coefficient other than intercepts in a fully saturated model, i.e. for clusters in semimatched samples or for pairs in fully-matched samples. If logistic regression is not used, then
the estimation must incorporate the weighing as here.
In a choice-based matched sample, data is not randomly sampled but rather follows the
following scheme:
1. Draw choice D=d and industry Z=z by ϕ (d , z ) .
2. Draw X by f ( X | d , z ) .
The joint density of the sampled data is then:
f * ( X | d , z ) = ϕ (d , z ) f ( X | d , z )
(B6)
Suppose that outcomes d range from 0 to I. Suppose industries z range from 1 to J. The
observed sample distribution of X is:
g * ( X ) = f ( X | d = 1, z = 1)ϕ (d = 1, z = 1) + ... + f ( X | d = I , z = J )ϕ (d = I , z = J )
and the probability in the sample of observing d given X is:
Pr * ( D = d , Z = z | X ) =
f ( X | D = d , Z = z ) Pr( D = d , Z = z )
f (X )
(B7)
Assume f(X)>0. Using Bayes’ theorem, write, for a fixed z,
Pr * ( D = 1, Z = z | X ) =
=
Pr f ( D = 1, Z = z ) f ( X )ϕ ( D = 1, Z = z )
Pr( D = 1, Z = z | X ) f ( X )ϕ ( D = 1, Z = z ) Pr( D = 0, Z = z | X ) f ( X )ϕ ( D = 0, Z = z )
+
Pr( D = 1, Z = z )
Pr( D = 0, Z = z )
49
Cancelling f(X)’s, and multiplying through by Pr( D = 1, Z = z ) , this reduces to:
ϕ ( D = 1, Z = z )
Pr( D = 1, Z = z | X )
Pr * ( D = 1, Z = z | X ) =
Pr( D = 1, Z = z | X ) + Pr( D = 0, Z = z | X )
Pr( D = 1, Z = z | X )ϕ ( D = 0, Z = z )
Pr( D = 0, Z = z | X )ϕ ( D = 1, Z = z )
Now divide above and below by Pr( D = 1, Z = z | X ) to yield:
Pr * ( D = 1, Z = z | X ) =
1
(B8)
Pr( D = 0, Z = z | X ) Pr( D = 1, Z = z )ϕ ( D = 0, Z = z )
1+
Pr( D = 1, Z = z | X ) Pr( D = 0, Z = z )ϕ ( D = 1, Z = z )
Recall, the log-odds form of the logit model resembles part of the above. In logit regression,
ln
Pr(d = 1, Z = z | X
= α z + xβ
Pr(d = 0, Z = z | X
which implies
Pr(d = 1, Z = z | X )
= e −(α z + xβ )
Pr( d = 0, Z = z | X )
So, if the logit model is true, then B5 becomes:
1
Pr * ( D = 1, Z = z | X ) =
1 + e −(α z + xβ ) e
1
=
1+ e
Pr( D =1, Z = z )ϕ ( D = 0 , Z = z )
ln{
]
Pr( D = 0 , Z = z )ϕ ( D =1, Z = z )
[B6]
Pr( D =1, Z = z )ϕ ( D = 0 , Z = z )
− (α z + xβ − ln{
]
Pr( D = 0 , Z = z )ϕ ( D =1, Z = z )
*
=
e α z + xβ
[1 + e
α *z + xβ
]
where α *z = α z − ln{Pr( D = 1, Z = z )ϕ ( D = 0, Z = z ) ] . Observe this is in the form of the usual logit
Pr( D = 0, Z = z )ϕ ( D = 1, Z = z )
estimator. Applying the usual logit estimator within this cluster z, then, we get an unbiased
estimate of β and only the intercept α z estimated for this cluster is not correct. Pooling across
all clusters, provided we include an intercept for each cluster, β will be estimated correctly.
50
Any cluster where all outcomes d are the same will contribute nothing to the estimation
of β ; note, the intercept for the cluster can be set arbitrarily high or low so that the cluster
likelihood approaches 1 and is unaffected by β . So clusters where all outcomes are the same
may be deleted from the sample estimation.
Therefore we have derived that estimation using usual logit estimators works correctly,
except for the intercepts, so we have deduced the claim following (B5) above. We have not
proven that the standard errors in estimation will be estimated correctly using the usual logit
estimators, but that can be shown as well. The proof extends naturally to the ordered logit
setting and to the multinomial logit settings.
Note, also, at an extreme where each cluster consists merely of a pair of observations
having different outcomes, we have matched pairs. As Breslow (1996) observes, the matched
pair result (B6) can be reached as the extreme of the stratification process so that each stratum
consists of just one pair.
51