How Much Can We Generalize from Impact Evaluations?

Transcription

How Much Can We Generalize from Impact Evaluations?
How Much Can We Generalize from Impact
Evaluations? Are They Worthwhile?
Eva Vivalt∗
Stanford University
November 8, 2015
Abstract
Impact evaluations aim to predict the future, but they are rooted in particular
contexts and to what extent they generalize is an open and important question. I
exploit a new data set of impact evaluation results on a wide variety of interventions in development to answer this and other questions. I find that while a good
model and the separation of sampling variance helps, results remain much more
heterogeneous than in other fields, such as medicine. Given the heterogeneity, an
obvious question is: are impact evaluations worthwhile? I model policymakers’
decisions and answer this question for different parameter values. I find that if
policymakers were to use the simplest, naive predictor of a program’s effects, they
would typically be off by 97%. Finally, I show how researchers can estimate the
generalizability of their own study using their own data, even when data from no
comparable studies exist.
∗
E-mail: [email protected]. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal Bó, Hunt
Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel,
Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia
University, New York University, the World Bank, Cornell University, Princeton University, the University
of Toronto, the London School of Economics, the Australian National University, the University of Ottawa,
and the Stockholm School of Economics, among others, and participants at the 2015 ASSA meeting and 2013
Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I
am also grateful for the hard work put in by many at AidGrade over the duration of this project, including
but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang,
Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha
Jalil, Risa Santoso and Catherine Razeto.
1
1
Introduction
In the last few years, rigorous impact evaluations have become extensively used in development economics research. Policymakers and donors typically fund impact evaluations
precisely to figure out how effective a similar program would be in the future to guide their
decisions on what course of action they should take. However, it is not yet clear how much we
can extrapolate from past results or under which conditions. Some have argued that impact
evaluation results are context-dependent in a way that prevents them from being informative
(Deaton, 2011; Pritchett and Sandefur, 2013). Further, there is some evidence that even a
similar program, in a similar environment, can yield different results. For example, Bold et al.
(2013) carry out an impact evaluation of a program to provide contract teachers in Kenya;
this was a scaled-up version of an earlier program studied by Duflo, Dupas and Kremer
(2012). The earlier intervention studied by Duflo, Dupas and Kremer was implemented by
an NGO, while Bold et al. compared implementation by an NGO and the government’s attempted scale-up. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed
significant results only for the NGO-implemented group. The different findings in the same
country for purportedly similar programs point to the substantial context-dependence of
impact evaluation results. Knowing the extent of this context-dependence is crucial in order
to understand what we can learn from any impact evaluation, and this paper provides that
information for a wide variety of interventions.
This paper also answers a second, related question: what is the value of an impact evaluation in terms of allowing policymakers to make better decisions? If the point of an impact
evaluation is to provide information, their use is naturally constrained by the extent to which
one can generalize from their findings. I model the policymaker’s decision problem and compare the potential costs of an impact evaluation with the benefits of improved predictive
ability leading to better policy decisions.
While the mains reason to examine generalizability are to aid interpretation and improve
predictions, it would also help to direct research attention to where it is most needed. If
generalizability were higher in some areas, fewer papers would be needed to understand how
people would behave in a similar situation; conversely, if there were topics or regions where
generalizability was low, it would call for further study. With more information, researchers
can better calibrate where to direct their attention to generate new insights.
Impact evaluations are still exponentially increasing in number and in terms of the resources devoted to them. The World Bank recently received a major grant from the UK aid
agency DFID to expand its already large impact evaluation works; the Millennium Challenge
Corporation has committed to conduct rigorous impact evaluations for 50% of its activities,
2
Figure 1: Growth of Impact Evaluations
This figure plots the number of studies that came out in each year that are contained in each of three
databases described in the text: 3ie’s title/abstract/keyword database of impact evaluations; J-PAL’s
database of affiliated randomized controlled trials; and AidGrade’s database of impact evaluation results
data.
with “some form of credible evaluation of impact” for every activity (Millennium Challenge
Corporation, 2009); and the U.S. Agency for International Development is also increasingly
invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of
program funds to evaluation.1
Yet while impact evaluations are still growing in development, a few thousand are already complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL,
a center for development economics research, have completed each year; alongside are the
number of development-related impact evaluations released that year according to 3ie, which
keeps a directory of titles, abstracts, and other basic information on impact evaluations more
broadly, including quasi-experimental designs; finally, the dashed line shows the number of
papers that came out in each year that are included in AidGrade’s database of impact evaluation results, which will be described shortly.
In short, while we do impact evaluation to figure out what will happen in the future,
many issues have been raised about how well we can extrapolate from past impact evaluations and, despite the importance of the topic, previously we were unable to do little more
than guess or examine the question in narrow settings as we did not have the data. Now we
have the opportunity to address speculation, drawing on a large, unique dataset of impact
1
While most of these are less rigorous “performance evaluations”, country mission leaders are supposed
to identify at least one opportunity for impact evaluation for each development objective in their 3-5 year
plans (USAID, 2011).
3
evaluation results.
I founded a non-profit organization dedicated to gathering these data. That organization, AidGrade, seeks to systematically understand which programs work best where, a task
that requires also knowing the limits of our knowledge. To date, AidGrade has conducted
20 meta-analyses and systematic reviews of different development programs.2 Data gathered through meta-analyses are the ideal data to answer the question of how much we can
extrapolate from past results, and since data on these 20 topics were collected in the same
way, coding the same outcomes and other variables, we can look across different types of
programs to see if there are any more general trends. Currently, the data set contains 647 papers on 210 narrowly-defined intervention-outcome combinations, with the greater database
containing 15,021 estimates.
A further contribution of this paper is the development of benchmarks or rules of thumb
that researchers or practitioners can use to gauge the external validity of their own work.
I discuss several metrics and show typical values across a range of interventions. Other
disciplines have considered generalizability more, so I draw on the literature relating to
meta-analysis, which has been most well-developed in medicine, as well as the psychometric
literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb,
2006; Briggs and Wilson, 2007).
Since we may not care about heterogeneity in treatment effects if it can be modelled, I
show how the measures I discuss could also be used in conjunction with explanatory models.
I use the concrete examples of conditional cash transfers (CCTs) and deworming programs,
which are relatively well-understood and on which many papers have been written, to elucidate the issues.
Ultimately, I find that while a good model and the separation of sampling variance helps,
results remain much more heterogeneous than in other fields, such as medicine. Policy decisions would seem to be improved by careful research designs that pay attention to the issue
of external validity since, as it stands, the typical prediction of the effect of a program might
be off by 97%.
Though this paper focuses on results for impact evaluations of development programs,
this is only one of the first areas within economics to which these kinds of methods can be
applied. Many impact evaluations also exist for domestic policies, for instance. In many of
the sciences, knowledge is built through a combination of researchers conducting individual
studies and other researchers synthesizing the evidence through meta-analysis. This paper
begins that natural next step.
2
Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomes
for meta-analysis and became systematic reviews.
4
2
2.1
Theory
The Policymaker’s Decision Problem
In the simplest case, a policymaker might face a choice between a program and an outside
option. The program’s effect if it were to be implemented in the policymaker’s setting, θi , is
unknown ex ante; the outside option’s effect is θ˚ . The policymaker’s prior is:
θi „ N pµ, τ 2 q
(1)
where µ and τ 2 are unknown hyperparameters.
The policymaker has the opportunity to observe a signal about the effect of the program
by conducting an impact evaluation. For example, this could be thought of as an evaluation
of a pilot before rolling out a program naturally. The signal is itself drawn from a distribution:
Yi |θi „ N pθi , σi2 q
(2)
where Yi is the observed effect size of a particular study and σi2 the sample variance.
The impact evaluation has a cost, c ą 0, and the policymaker needs to decide whether
the value of the information provided by the signal is worth it. I assume the policymaker
uses Bayesian updating. θi can then neatly be estimated using Bayesian meta-analysis.
As a quick review, the meta-analysis literature suggests two general types of models that
can be parameterized in many ways: fixed-effect models and random-effects models.
Fixed-effect models assume there is one true effect of a particular program and all differences between studies can be attributed simply to sampling error. In other words:
Yi “ θ ` εi
(3)
where θ is the true effect and εi is the error term.
Random-effects models do not make this assumption; the true effect could potentially
vary from context to context. Here,
Yi “ θi ` εi
“ θ̄ ` ηi ` εi
(4)
(5)
where θ̄ is the mean true effect size, ηi is a particular study’s divergence from that mean
true effect size, and εi is the error. Random-effects models are more plausible and they are
necessary if we think there are heterogeneous treatment effects, so I use them in this paper.
5
Random-effects models can also be modified by the addition of explanatory variables, at
which point they are called mixed models; I will also use mixed models in this paper.
I begin by presenting the random-effects model, followed by the related strategy to estimate a mixed model. The random-effects model is fully described in Gelman et al. (2013),
from which the next section heavily draws.
2.1.1
Estimating a Random-Effects Model
To build a hierarchical Bayesian random-effects model, I first assume the data are normally distributed:
Yij |θi „ N pθi , σ 2 q
(6)
where j indexes the individuals in the study. I do not have individual-level data, but instead
can use sufficient statistics:
Yi |θi „ N pθi , σi2 q
(7)
where Yi is the sample mean and σi2 the sample variance. This provides the likelihood for θi .
I also need a prior for θi . As discussed, I assume between-study normality:
θi „ N pµ, τ 2 q
(8)
where µ and τ are unknown hyperparameters.
Conditioning on the distribution of the data, given by Equation 7, I get a posterior:
θi |µ, τ, Y „ N pθˆi , Vi q
where
θˆi “
Yi
σi2
1
σi2
`
`
µ
τ2
1
τ2
, Vi “
1
σi2
1
`
(9)
1
τ2
(10)
I then need to pin down µ|τ and τ by constructing their posterior distributions given
non-informative priors and updating based on the data. I assume a uniform prior for µ|τ ,
and as the Yi are estimates of µ with variance pσi2 ` τ 2 q, obtain:
µ|τ, Y „ N pµ̂, Vµ q
where
Yi
i σi2 `τ 2
ř 1
i σi2 `τ 2
(11)
ř
µ̂ “
, Vµ “
6
ÿ
1
i
σi2 `τ 2
1
(12)
|Y q
For τ , note that ppτ |Y q “ ppµ,τ
. The denominator follows from Equation 12; for the
ppµ|τ,Y q
numerator, we can observe that ppµ, τ |Y q is proportional to ppµ, τ qppY |µ, τ q, and we know
the marginal distribution of Yi |µ, τ :
Yi |µ, τ „ N pµ, σi2 ` τ 2 q
(13)
I use a uniform prior for τ , following Gelman et al. (2013). This yields the posterior for
the numerator:
ź
ppµ, τ |Y q9ppµ, τ q
N pYi |µ, σi2 ` τ 2 q
(14)
i
Putting together all the pieces in reverse order, I first simulate τ , then generate ppτ |Y q
using τ , followed by µ and finally θi .
2.1.2
Estimating a Mixed Model
The strategy here is similar. Appendix E contains a derivation.
2.2
Solving the Policymaker’s Problem and Extensions
The random-effects or mixed model both would yield a predicted θi , with or without the
use of explanatory variables. We can imagine that with infinite data, θpi Ñ θi . The model
also enables a back-of-the-envelope calculation of how much it might cost a policymaker to
make the “wrong” decision, which would let us say for what set of conditions it might be
useful to spend money on an additional impact evaluation. Remembering that the outside
option’s effects are θ˚ , if θpi ą θ˚ and θi ă θ˚ , or vice versa, the policymaker is making a
mistake costing the value of |θi ´ θ˚ |. We can either think of a function that assigns a cost
to each θi or keep all calculations in terms of θi .
This exercise is limited to the costs and benefits a policymaker would face if trying to find
the most efficient program with which to achieve a particular policy goal and does not take
into consideration other goals a policymaker might have, such as political concerns. While
thus limited, we might think that a policymaker that seeks to make the best policy decisions
represents the ideal social planner.
Though this model was framed in terms of comparing a program against an outside option
for the sake of simplicity, we might think that the outside option could itself be uncertain.
It might be more realistic to think of the policymaker as choosing between two uncertain
programs. This would be straightforward to accommodate in the model: we would only have
to consider the difference in effects between two programs as the θi that is being estimated.
Finally, an impact evaluation can also be thought of as providing a public good that
7
multiple policymakers could take advantage of; in this case, an impact evaluation could still
be worthwhile as a public good, even in some cases in which it would not be worthwhile as
a private good. I will consider this possibility in the discussion of the results.
3
Data
This paper uses a database of impact evaluation results collected by AidGrade, a U.S.
non-profit research institute that I founded in 2012. AidGrade focuses on gathering the
results of impact evaluations and analyzing the data, including through meta-analysis. Its
data on impact evaluation results were collected in the course of its meta-analyses from
2012-2014 (AidGrade, 2015).
AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search
for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In
addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will
discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more
detail is provided in Appendix B.
3.1
Selection of Papers
The interventions that were selected for meta-analysis were selected largely on the basis
of there being a sufficient number of studies on that topic. Five AidGrade staff members each
independently made a preliminary list of interventions for examination; the lists were then
combined and searches done for each topic to determine if there were likely to be enough
impact evaluations for a meta-analysis. The remaining list was voted on by the general
public online and partially randomized. Appendix B provides further detail.
A comprehensive literature search was done using a mix of the search aggregators SciVerse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA
and 3ie were also searched for completeness. Finally, the references of any existing systematic reviews or meta-analyses were collected.
Any impact evaluation which appeared to be on the intervention in question was included,
barring those in developed countries.3 Any paper that tried to consider the counterfactual
was considered an impact evaluation. Both published papers and working papers were included. The search and screening criteria were deliberately broad. There is not enough
room to include the full text of the search terms and inclusion criteria for all 20 topics in
this paper, but these are available in an online appendix as detailed in Appendix A.
3
High-income countries, according to the World Bank’s classification system.
8
3.2
Data Extraction
The subset of the data on which I am focusing is based on those papers that passed all
screening stages in the meta-analyses. Again, the search and screening criteria were very
broad and, after passing the full text screening, the vast majority of papers that were later
excluded were excluded merely because they had no outcome variables in common or did
not provide adequate data (for example, not providing data that could be used to calculate
the standard error of an estimate, or for a variety of other quirky reasons, such as displaying
results only graphically). The small overlap of outcome variables is a surprising and notable
feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results
(double-coded and then reconciled by a third researcher) across 647 papers covering the 20
types of development program listed in Table 1.4 For sake of comparison, though the two
organizations clearly do different things, at present time of writing this is more impact evaluations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318
of these papers both overlapped in outcomes with another paper and were able to be standardized and thus included in the main results which rely on intervention-outcome groups.
Outcomes were defined under several rules of varying specificity, as will be discussed shortly.
Table 1: List of Development Programs Covered
2012
2013
Conditional cash transfers
Contract teachers
Deworming
Financial literacy training
Improved stoves
HIV education
Insecticide-treated bed nets Irrigation
Microfinance
Micro health insurance
Safe water storage
Micronutrient supplementation
Scholarships
Mobile phone-based reminders
School meals
Performance pay
Unconditional cash transfers Rural electrification
Water treatment
Women’s empowerment programs
73 variables were coded for each paper. Additional topic-specific variables were coded for
some sets of papers, such as the median and mean loan size for microfinance programs. This
4
Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or
voice reminders for health-related outcomes. “Women’s empowerment programs” required an educational
component to be included in the intervention and it could not be an unrelated intervention that merely disaggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narrowed
down to focus on those providing zinc to children, but the other micronutrient papers are still included in
the data, with a tag, as they may still be useful.
9
paper focuses on the variables held in common across the different topics. These include
which method was used; if randomized, whether it was randomized by cluster; whether it
was blinded; where it was (village, province, country - these were later geocoded in a separate process); what kind of institution carried out the implementation; characteristics of the
population; and the duration of the intervention from the baseline to the midline or endline
results, among others. A full set of variables and the coding manual is available online, as
detailed in Appendix A.
As this paper pays particular attention to the program implementer, it is worth discussing
how this variable was coded in more detail. There were several types of implementers that
could be coded: governments, NGOs, private sector firms, and academics. There was also a
code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were
implemented by academic research teams and NGOs. This paper considers NGOs and academic research teams together because it turned out to be practically difficult to distinguish
between them in the studies, especially as the passive voice was frequently used (e.g. “X
was done” without noting who did it). There were only a few private sector firms involved,
so they are considered with the “other” category in this paper.
Studies tend to report results for multiple specifications. AidGrade focused on those
results least likely to have been influenced by author choices: those with the fewest controls, apart from fixed effects. Where a study reported results using different methodologies,
coders were instructed to collect the findings obtained under the authors’ preferred methodology; where the preferred methodology was unclear, coders were advised to follow the
internal preference ordering of prioritizing randomized controlled trials, followed by regression discontinuity designs and differences-in-differences, followed by matching, and to collect
multiple sets of results when they were unclear on which to include. Where results were
presented separately for multiple subgroups, coders were similarly advised to err on the side
of caution and to collect both the aggregate results and results by subgroup except where the
author appeared to be only including a subgroup because results were significant within that
subgroup. For example, if an author reported results for children aged 8-15 and then also
presented results for children aged 12-13, only the aggregate results would be recorded, but
if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups
would be coded as well as the aggregate result when presented. Authors only rarely reported
isolated subgroups, so this was not a major issue in practice.
When considering the variation of effect sizes within a group of papers, the definition of
the group is clearly critical. Two different rules were initially used to define outcomes: a
strict rule, under which only identical outcome variables are considered alike, and a loose
rule, under which similar but distinct outcomes are grouped into clusters.
10
The precise coding rules were as follows:
1. We consider outcome A to be the same as outcome B under the “strict rule” if outcomes A and B measure the exact same quality. Different units may be used, pending
conversion. The outcomes may cover different timespans (e.g. encompassing both
outcomes over “the last month” and “the last week”). They may also cover different
populations (e.g. children or adults). Examples: height; attendance rates.
2. We consider outcome A to be the same as outcome B under the “loose rule” if they
do not meet the strict rule but are clearly related. Example: parasitemia greater than
4000/µl with fever and parasitemia greater than 2500/µl.
Clearly, even under the strict rule, differences between the studies may exist, however, using
two different rules allows us to isolate the potential sources of variation, and other variables
were coded to capture some of this variation, such as the age of those in the sample. If one
were to divide the studies by these characteristics, however, the data would usually be too
sparse for analysis.
Interventions were also defined separately and coders were also asked to write a short
description of the details of each program. Program names were recorded so as to identify
those papers on the same program, such as the various evaluations of PROGRESA.
After coding, the data were then standardized to make results easier to interpret and
so as not to overly weight those outcomes with larger scales. The typical way to compare
results across different outcomes is by using the standardized mean difference, defined as:
SM D “
µ1 ´ µ2
σp
where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control
group, and σp is the pooled standard deviation. When data are not available to calculate the
pooled standard deviation, it can be approximated by the standard deviation of the dependent variable for the entire distribution of observations or as the standard deviation in the
control group (Glass, 1976). If that is not available either, due to standard deviations not
having been reported in the original papers, one can use the typical standard deviation for
the intervention-outcome. I follow this approach to calculate the standardized mean difference, which is then used as the effect size measure for the rest of the paper unless otherwise
noted.
This paper uses the “strict” outcomes where available, but the “loose” outcomes where
that would keep more data. For papers which were follow-ups of the same study, the most
recent results were used for each outcome.
11
Finally, one paper appeared to misreport results, suggesting implausibly low values and
standard deviations for hemoglobin. These results were excluded and the paper’s corresponding author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8
SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual
results, especially with the small number of papers in some intervention-outcome groups, I
restrict attention to those standardized effect sizes less than 2 SD away from 0. I report
main results including this one additional observation in the Appendix.
3.3
Data Description
Figure 2 summarizes the distribution of studies covering the interventions and outcomes
considered in this paper that can be standardized. Attention will typically be limited to
those intervention-outcome combinations on which we have data for at least three papers.
Table 12 in Appendix D lists the interventions and outcomes and describes their results in
a bit more detail, providing the distribution of significant and insignificant results. It should
be emphasized that the number of negative and significant, insignificant, and positive and
significant results per intervention-outcome combination only provide ambiguous evidence
of the typical efficacy of a particular type of intervention. Simply tallying the numbers in
each category is known as “vote counting” and can yield misleading results if, for example,
some studies are underpowered.
Table 2 further summarizes the distribution of papers across interventions and highlights
the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent
with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt
(2015a) finds that later papers on the same intervention-outcome combination more often
remain as working papers.
A note must be made about combining data. When conducting a meta-analysis, the
Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the
data to one observation per intervention-outcome-paper, and I do this for generating the
within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had
been reported for multiple subgroups (e.g. women and men), I aggregated them as in the
Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods
(e.g. 6 months after the intervention and 12 months after the intervention), I used the most
comparable time periods across papers. When combining across multiple outcomes, which
has limited use but will come up later in the paper, I used the formulae from Borenstein et
al. (2009), Chapter 24.
12
Figure 2: Within-Intervention-Outcome Number of Papers
13
Table 2: Descriptive Statistics: Distribution of Narrow Outcomes
Intervention
Number of
outcomes
Mean papers
per outcome
Max papers
per outcome
Conditional cash transfers
Contract teachers
Deworming
Financial literacy
HIV/AIDS Education
Improved stoves
Insecticide-treated bed nets
Irrigation
Micro health insurance
Microfinance
Micronutrient supplementation
Mobile phone-based reminders
Performance pay
Rural electrification
Safe water storage
Scholarships
School meals
Unconditional cash transfers
Water treatment
Women’s empowerment programs
10
1
12
1
3
4
1
2
1
5
23
2
1
3
1
3
3
3
2
2
21
3
13
5
8
2
9
2
2
4
27
4
3
3
2
4
3
9
5
2
37
3
18
5
10
2
9
2
2
5
47
5
3
3
2
5
3
11
6
2
Average
4.2
6.5
9.0
14
4
4.1
Method
Estimation of the Gains in Accuracy of θpi
The model contains four parameters: µ, τ 2 , σi2 , and θi . σi2 is provided by the data; τ 2 , µ
and σi are estimated as described in Section 2.
These estimates are done within intervention-outcome combinations. To approximate
the effects of a policymaker learning more information as more studies are completed, I
use all the different permutations of studies, in turn, when generating the estimates. For
example, if three studies, labelled 1, 2, 3, considered the effects of a particular intervention
on a particular outcome, I would generate estimates of θi using each set of studies {1}, {2},
{3}, {1,2}, {1,3}, {2,3}, {1,2,3}. Each θpi is then associated with a number of studies, n,
that went into estimating it, denoted θpi,n . I approximate θi (θpi,n as n Ñ 8), with θpi,N , the
estimate from using the full data available for that intervention-outcome combination. Figure 3 shows how the mean θpi,n evolves as more data are added for each intervention-outcome
combination. In particular, the Y-axis represents the difference between the mean θpi,n and
mean θpi,N , for various n. The convergence of mean θpi,n to θpi,N is mechanical due to how θpi,n
is constructed, however, the figure illustrates that convergence is often attained by n = 10
even when more studies exist. This suggests that using θpi,N to estimate θi is particularly
reasonable for N ě 10. I will use this number as a minimum cut-off for N in a robustness
check.
Figure 3: Evolution of Mean θpi,n
The Y-axis represents the difference between the mean θpi,n and mean θpi,N .
15
4.2
Generalizability Measures
It would be helpful to have some rules of thumb that can be used to gauge the expected
external validity of a particular study’s results or the expected likelihood that an impact
evaluation will be worthwhile. I will refer throughout to several metrics that could be used
as measures of generalizability. I can then relate these measures to results from the model.
Since generalizability can be thought of as the ability to predict results accurately in
an out-of-sample group, and the ability to predict results, in turn, hinges both on 1) the
variability of the results and 2) the proportion that can be explained, we will want measures
that speak to´ each of
variance
(var(Yi )) and coefficient
¯ these. In particular, I will focus on
´ the
¯
sd
pYi q
τ2
2
to capture variability and the I τ 2 `σ2 as a measure of the proportion
of variation
Y¯i
of variation that is systematic.5 I will also separate out the sampling variance and use
explanatory variables to reduce the unexplained heterogeneity, resulting in the amount of
residual variation in Yi , varR pYi q, the coefficient of residual variation, CVR pYi q, and the
residual IR2 . Appendix C has more information on these measures and motivates their use.
It is important to note that each measure captures different things and has advantages and
disadvantages, as summarized in Tables 10 and 11 in that section. Multiple measures should
be simultaneously used to get a full picture.
5
Results
In this section, I will first present results without modelling any of the heterogeneity in
outcomes. We will see that the values for all the measures of heterogeneity are quite high,
and an additional impact evaluation would typically improve estimates of θi by only 0.002 0.015 standard deviations. Given an average effect size of 0.12, this is approximately 1.6% 12.3%.
5.1
Without Modelling Heterogeneity
Table 3 presents results within intervention-outcome combinations. All Yi were converted
to be in terms of standard deviations to put them on a common scale before statistics were
calculated, with the aforementioned caveats. ρA,B represents the average change in the
difference between θpi and θi when moving from using data from A studies to using data from
5
This paper follows convention and reports the absolute value of the coefficient of variation wherever it
appears.
16
Figure 4: Dispersion of Estimates
Various θpi,n estimates for τ 2 and σi2 in the data. Outliers are dropped for easier viewing.
B studies to estimate θi . If we restrict attention to those intervention-outcome combinations
with N ě 10, for which θpi,N may better approximate θi , the average amount an additional
impact evaluation would improve estimates of θi decreases to 0.0002 - 0.007, or 0.1 - 5.7%.
The marginal improvements are higher for smaller n, as we might expect.
Figure 4 plots the distribution of θpi generated for each subset of the data within an
intervention-outcome combination. Each intervention-outcome combination is associated
with a unique τ 2 and σi2 in this plot; the τ 2 and σi2 that were estimated using all the
data available for that intervention-outcome combination. Each vertical line of dots thus
represents a range of different estimates of θpi . If θpi ą θ˚ but θi ă θ˚ , or vice versa, the
policymaker is making a mistake. θ˚ is set at 0.1 in the diagram for illustration. Clearly,
with a lot of dispersion around 0.1, the chances of making a mistake are high. The likelihood
of making a mistake would be lower for particularly high or low θ˚ , but additional impact
evaluations do not help to reduce this likelihood by much.
The different heterogeneity measures, var(Yi ), CV(Yi ) and I 2 , yield varied results. As
previously discussed, they measure different things. The coefficient of variation depends
heavily on the mean; the I 2 , on the precision of the underlying estimates. The interventionoutcome combinations that fall within the bottom third by variance have varpYi q ď 0.015;
the top third have varpYi q ě 0.052. Similarly, the threshold delineating the bottom third for
the coefficient of variation is 1.14 and, for the top third, 2.36. For I 2 , the lower threshold is
0.93 and the upper threshold approaches 1.
17
Intervention
Table 3: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes
Outcome
ρ1,2
ρ5,6
ρ10,11 var(Yi ) CV(Yi )
18
Microfinance
Rural Electrification
Micronutrients
Microfinance
Microfinance
Financial Literacy
Microfinance
Contract Teachers
Performance Pay
Micronutrients
Conditional Cash Transfers
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Conditional Cash Transfers
Deworming
Micronutrients
Conditional Cash Transfers
Unconditional Cash Transfers
Water Treatment
SMS Reminders
Conditional Cash Transfers
School Meals
Micronutrients
Micronutrients
Micronutrients
Bed Nets
Conditional Cash Transfers
Assets
Enrollment rate
Cough prevalence
Total income
Savings
Savings
Profits
Test scores
Test scores
Body mass index
Unpaid labor
Weight-for-age
Weight-for-height
Birthweight
Height-for-age
Test scores
Hemoglobin
Mid-upper arm circumference
Enrollment rate
Enrollment rate
Diarrhea prevalence
Treatment adherence
Labor force participation
Test scores
Height
Mortality rate
Stunted
Malaria
Attendance rate
0.00
0.01
0.01
0.00
0.00
0.01
0.01
0.01
0.01
0.00
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.00
0.01
0.01
0.01
0.00
0.05
-0.01
0.03
0.02
0.02
0.03
0.00
0.01
0.00
0.01
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.02
0.01
0.00
0.00
0.01
0.00
0.00
0.00
0.000
0.001
0.001
0.001
0.002
0.004
0.005
0.005
0.006
0.007
0.009
0.009
0.010
0.010
0.012
0.013
0.015
0.015
0.015
0.016
0.020
0.022
0.023
0.023
0.023
0.025
0.025
0.029
0.030
5.51
0.13
1.65
0.99
1.77
5.47
5.45
0.40
0.61
0.67
0.92
1.94
2.15
0.98
2.47
1.87
3.38
2.08
0.83
1.09
0.97
1.67
1.63
1.29
4.37
2.88
1.11
0.50
0.52
I2
N
1.00
0.93
1.00
1.00
1.00
0.99
1.00
1.00
1.00
1.00
0.93
0.98
0.80
0.94
1.00
1.00
1.00
0.53
1.00
1.00
1.00
0.93
0.33
0.58
0.98
0.07
0.12
0.97
1.00
4
3
3
5
3
5
5
3
3
5
5
34
26
7
36
5
15
18
37
11
6
5
17
3
32
12
5
9
15
19
Micronutrients
HIV/AIDS Education
Micronutrients
Deworming
Micronutrients
Scholarships
Conditional Cash Transfers
Deworming
Micronutrients
School Meals
Micronutrients
Deworming
Deworming
Micronutrients
Micronutrients
Micronutrients
Deworming
Micronutrients
SMS Reminders
Deworming
Conditional Cash Transfers
Rural Electrification
Average
Weight
Used contraceptives
Perinatal deaths
Height
Test scores
Enrollment rate
Height-for-age
Weight-for-height
Stillbirths
Enrollment rate
Prevalence of anemia
Height-for-age
Weight-for-age
Diarrhea incidence
Diarrhea prevalence
Fever prevalence
Weight
Hemoglobin
Appointment attendance rate
Mid-upper arm circumference
Probability unpaid work
Study time
0.00
0.02
0.02
0.02
0.00
0.01
0.04
0.00
0.03
0.11
0.02
0.03
-0.01
-0.04
0.03
0.02
0.02
-0.01
0.01
0.00
0.02
0.09
0.015
0.00
0.01
0.02
-0.01
0.00
0.01
-0.01
0.00
0.00
0.00
-0.01
0.02
0.00
-0.01
0.03
0.00
-0.01
0.00
0.00
-0.03
0.00
0.03
0.02
0.00
0.002
0.003
0.034
0.036
0.038
0.049
0.052
0.053
0.055
0.072
0.075
0.081
0.095
0.098
0.107
0.109
0.111
0.146
0.184
0.215
0.224
0.439
0.609
0.997
2.70
3.12
2.10
2.36
1.69
0.69
22.17
3.13
3.04
1.14
0.79
1.98
2.29
3.30
1.21
3.08
4.76
1.44
2.91
1.77
6.42
1.10
0.73
0.51
0.05
1.00
1.00
1.00
0.03
1.00
0.01
0.01
0.90
1.00
1.00
1.00
0.98
0.76
1.00
1.00
0.99
1.00
0.97
0.03
36
10
6
17
10
5
7
11
4
3
15
14
12
11
6
5
18
46
3
7
5
3
0.083
2.52
0.80
12
How should we interpret these numbers? Higgins and Thompson, who defined I 2 , called
0.25 indicative of “low”, 0.5 “moderate”, and 0.75 “high” levels of heterogeneity (2002; Higgins et al., 2003). Figure 5 plots a histogram of the I 2 results, showing a lot of systematic
variation according to this scale. No similar defined benchmarks exist for the variance, but
studies in the medical literature tend to exhibit a coefficient of variation of approximately
0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results would appear quite heterogeneous.
Figure 5: Density of I 2 values
An alternative benchmark that might have more appeal is that of the average withinstudy variation. If the across-study variation approached the within-study variation, we
might not be so concerned about generalizability.
Table 13 in Appendix D illustrates the gap between the across-study and mean
within-study variance, coefficient of variation, and I 2 for those intervention-outcomes for
which we have enough data to calculate the within-study measures. Not all studies report
multiple results for the intervention-outcome combination in question. A paper might
report multiple results for a particular intervention-outcome combination if, for example,
it were reporting results for different subgroups, such as for different age groups, genders,
or geographic areas. The median within-paper variance for those papers for which this can
be generated is 0.027, while it is 0.037 across papers; similarly, the median within-paper
coefficient of variation is 0.91, compared to 1.48 across papers. If we were to form the I 2 for
each paper separately, the median within-paper value would be 0.64, as opposed to 1 across
papers. Figure 6 presents the distributions graphically; to increase the sample size, this
figure includes results even when there are only two papers within an intervention-outcome
20
Figure 6: Distribution of within and across-paper heterogeneity measures
As there are a few outliers in the right tail, variances above 0.25 and coefficients of variation above 10 are
dropped from these figures (dropping 5 observations).
combination or two results reported within a paper.
Finally, we can try to derive benchmarks more directly, based on the expected prediction
error. Again, it is immediately apparent that what counts as large or small error depends
on the policy question - the outside option θ˚ . In some cases, it might not matter if an
effect size were mis-predicted by 25%. In others, a prediction error of this magnitude could
mean the difference between choosing one program over another or whether a program is
worthwhile to pursue at all.
Still, if we take the mean effect size within an intervention-outcome to be our “best
guess” of how a program will perform and, as an illustrative example, want the prediction
error to be less than 25% at least 50% of the time, this would imply a certain cut-off
threshold for the variance if we assume that results are normally distributed. Note that the
assumptions that results are drawn from the same normal distribution and that the mean
and variance of this distribution can be approximated by the mean and variance of observed
results is a simplification for the purpose of a back-of-the-envelope calculation.
Table 4 summarizes the implied bounds for var(Yi ) for the prediction error to be less
than 25% and 50%, respectively, at least 50% of the time, alongside the actual variance
in results within each intervention-outcome. In only 1 of 51 cases is the true variance in
results smaller than the variance implied by the 25% prediction error cut-off threshold, and
in 9 other cases it is below the 50% prediction error threshold. In other words, for more
than 80% of intervention-outcomes, the implied prediction error is greater than 50% more
than 50% of the time.
Intervention
Microfinance
Table 4: Actual Variance vs. Variance for Prediction Error Thresholds
Outcome
Ȳi
varpYi q var25
Assets
0.003
21
0.000
0.000
var50
0.000
Rural Electrification
Micronutrients
Microfinance
Microfinance
Financial Literacy
Microfinance
Contract Teachers
Performance Pay
Micronutrients
Conditional Cash Transfers
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Conditional Cash Transfers
Deworming
Micronutrients
Conditional Cash Transfers
Unconditional Cash Transfers
Water Treatment
SMS Reminders
Conditional Cash Transfers
School Meals
Micronutrients
Micronutrients
Micronutrients
Bed Nets
Conditional Cash Transfers
Micronutrients
HIV/AIDS Education
Micronutrients
Deworming
Micronutrients
Scholarships
Conditional Cash Transfers
Deworming
Micronutrients
School Meals
Micronutrients
Deworming
Deworming
Micronutrients
Micronutrients
Enrollment rate
Cough prevalence
Total income
Savings
Savings
Profits
Test scores
Test scores
Body mass index
Unpaid labor
Weight-for-age
Weight-for-height
Birthweight
Height-for-age
Test scores
Hemoglobin
Mid-upper arm circumference
Enrollment rate
Enrollment rate
Diarrhea prevalence
Treatment adherence
Labor force participation
Test scores
Height
Mortality rate
Stunted
Malaria
Attendance rate
Weight
Used contraceptives
Perinatal deaths
Height
Test scores
Enrollment rate
Height-for-age
Weight-for-height
Stillbirths
Enrollment rate
Prevalence of anemia
Height-for-age
Weight-for-age
Diarrhea incidence
Diarrhea prevalence
22
0.176
-0.016
0.029
0.027
-0.012
-0.013
0.182
0.131
0.125
0.103
0.050
0.045
0.102
0.044
0.062
0.036
0.058
0.150
0.115
0.145
0.088
0.092
0.117
0.035
-0.054
0.143
0.342
0.333
0.068
0.061
-0.093
0.094
0.134
0.336
-0.011
0.086
-0.090
0.250
0.389
0.159
0.143
0.100
0.277
0.001
0.001
0.001
0.002
0.004
0.005
0.005
0.006
0.007
0.009
0.009
0.010
0.010
0.012
0.013
0.015
0.015
0.015
0.016
0.020
0.022
0.023
0.023
0.023
0.025
0.025
0.029
0.030
0.034
0.036
0.038
0.049
0.052
0.053
0.055
0.072
0.075
0.081
0.095
0.098
0.107
0.109
0.111
0.005
0.000
0.000
0.000
0.000
0.000
0.005
0.003
0.002
0.002
0.000
0.000
0.002
0.000
0.001
0.000
0.001
0.003
0.002
0.003
0.001
0.001
0.002
0.000
0.000
0.003
0.018
0.017
0.001
0.001
0.001
0.001
0.003
0.017
0.000
0.001
0.001
0.009
0.023
0.004
0.003
0.002
0.012
0.027
0.000
0.001
0.001
0.000
0.000
0.029
0.015
0.014
0.009
0.002
0.002
0.009
0.002
0.003
0.001
0.003
0.019
0.011
0.018
0.007
0.007
0.012
0.001
0.003
0.018
0.101
0.096
0.004
0.003
0.008
0.008
0.016
0.098
0.000
0.006
0.007
0.054
0.131
0.022
0.018
0.009
0.066
Micronutrients
Deworming
Micronutrients
SMS Reminders
Deworming
Conditional Cash Transfers
Rural Electrification
Fever prevalence
Weight
Hemoglobin
Appointment attendance rate
Mid-upper arm circumference
Probability unpaid work
Study time
0.124
0.090
0.322
0.163
0.373
-0.122
0.906
0.146
0.184
0.215
0.224
0.439
0.609
0.997
0.002
0.001
0.016
0.004
0.021
0.002
0.125
0.013
0.007
0.090
0.023
0.121
0.013
0.710
var25 represents the variance that would result in a 25% prediction error for draws from a normal
distribution centered at Ȳi . var50 represents the variance that would result in a 50% prediction error.
5.1.1
Predicting External Validity from a Single Paper
It would be very helpful if we could estimate the across-paper within-interventionoutcome metrics using the results from individual papers. Many papers report results for
different subgroups or over time, and the variation in results for a particular interventionoutcome within a single paper could be a plausible proxy of variation in results for that same
intervention-outcome across papers. If this relationship holds, it would help researchers estimate the external validity of their own study, even when no other studies on the intervention
have been completed. Table 5 shows the results of regressing the across-paper measures of
var(Yi ) and CV(Yi ) on the average within-paper measures for the same intervention-outcome
combination.
It appears that within-paper variation in results is indeed significantly correlated with
across-paper variation in results. Authors could undoubtedly obtain even better estimates
using micro data.
5.1.2
Robustness Checks
One may be concerned that low-quality papers are either inflating or depressing the degree
of generalizability that is observed. There are many ways to measure paper “quality”; I
consider two. First, I use the most widely-used quality assessment measure, the Jadad scale
(Jadad et al., 1996). The Jadad scale asks whether the study was randomized, doubleblind, and whether there was a description of withdrawals and dropouts. A paper gets one
point for having each of these characteristics; in addition, a point is added if the method
of randomization was appropriate, subtracted if the method is inappropriate, and similarly
added if the blinding method was appropriate and subtracted if inappropriate. This results in
a 0-5 point scale. Given that the kinds of interventions being tested are not typically readily
suited to blinding, I consider all those papers scoring at least a 3 to be “high quality”.
In an alternative specification, I also consider only those results from studies that were
23
Table 5: Regression of Mean Within-Paper Heterogeneity on Across-Paper Heterogeneity
(1)
Across-paper variance
b/se
Mean within-paper variance
(2)
Across-paper CV
b/se
0.343**
(0.13)
Mean within-paper CV
-0.000
(0.00)
Mean within-paper I 2
Constant
Observations
R2
(3)
Across-paper I 2
b/se
0.101*
(0.06)
0.794
(0.67)
0.498***
(0.12)
0.507***
(0.10)
51
0.04
48
0.00
51
0.26
The mean of each within-paper measure is created by calculating the measure within a paper, for each
paper reporting two or more results on the same intervention-outcome combination, and then averaging
that measure across papers within the intervention-outcome.
RCTs. This is for two reasons. First, many would consider RCTs to be higher-quality
studies. We might also be concerned about how specification searching and publication bias
could affect results. In a separate paper (Vivalt, 2015a), I discuss these issues at length and
find relatively little evidence of these biases in the data, with RCTs exhibiting even fewer
signs of specification searching and publication bias. The results based on only those studies
which were RCTs thus provide a good robustness check.
Tables 14 and 15 in the Appendix provide robustness checks using the data that meet
these two quality criteria. Table 16 also includes the one observation previously dropped
for having an effect size more than 2 SD away from 0. The heterogeneity measures are not
substantially different using these data sets.
5.2
5.2.1
With Modelling Heterogeneity
Modelling Heterogeneity Across Intervention-Outcomes
If the heterogeneity in outcomes that has been observed can be systematically modelled,
it would improve our ability to make predictions. Do results exhibit any variation that is
systematic? To begin, I first present some OLS results, looking across different interventionoutcome combinations, to examine whether effect sizes are associated with any characteristics
of the program, study, or sample, pooling data from different intervention-outcomes.
24
As Table 6 indicates, there is some evidence that studies with a smaller number of
observations have greater effect sizes than studies based on a larger number of observations.
This is what we would expect if specification searching were easier in small datasets; this
pattern of results would also be what we would expect if power calculations drove researchers
to only proceed with studies with small sample sizes if they believed the program would result
in a large effect size or if larger studies are less well-targeted. Interestingly, governmentimplemented programs fare worse even controlling for sample size (the dummy variable
category left out is “Other-implemented”, which mainly consists of collaborations and private
sector-implemented interventions). Studies in the Middle East / North Africa region may
appear to do slightly better than those in Sub-Saharan Africa (the excluded region category),
but not much weight should be put on this as very few studies were conducted in the former
region.
While these regressions have the advantage of allowing me to draw on a larger sample
of studies and we might think that any patterns observed across so many interventions and
outcomes are fairly robust, we might be able to explain more variation if we restrict attention
to a particular intervention-outcome combination. I therefore focus on the case of conditional
cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination
that contains the largest number of papers.
5.2.2
Within an Intervention-Outcome: The Case of CCTs and Enrollment
Rates
The previous results used the across-intervention-outcome data, which were aggregated
to one result per intervention-outcome-paper. However, we might think that more variation
could be explained by carefully modelling results within a particular intervention-outcome
combination. This section provides an example, using the case of conditional cash transfers
and enrollment rates, the intervention-outcome combination covered by the most papers.
Suppose we were to try to explain as much variability in outcomes as possible, using
sample characteristics. The available variables which might plausibly have a relationship to
effect size are: the baseline enrollment rates6 ; the sample size; whether the study was done
in a rural or urban setting, or both; results for other programs in the same region7 ; and the
age and gender of the sample under consideration.
6
In some cases, only endline enrollment rates are reported. This variable is therefore constructed by
using baseline rates for both the treatment and control group where they are available, followed by, in turn,
the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for the
control group; the endline rate for the treatment and control group; and the endline rate for the treatment
group
7
Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia,
following the World Bank’s geographical divisions.
25
Table 6: Regression of Effect Size on Study Characteristics
(1)
Effect size
Number of
observations (100,000s)
Government-implemented
(2)
Effect size
(3)
Effect size
-0.011**
(0.00)
RCT
-0.012***
(0.00)
-0.009*
(0.00)
-0.087**
(0.04)
-0.057
(0.05)
0.038
(0.03)
East Asia
0.120***
(0.00)
0.180***
(0.03)
0.091***
(0.02)
-0.003
(0.03)
0.012
(0.04)
0.275**
(0.11)
0.021
(0.04)
0.105***
(0.02)
556
0.20
656
0.23
656
0.22
556
0.23
Latin America
Middle East/North
Africa
South Asia
Observations
R2
(5)
Effect size
-0.107***
(0.04)
-0.055
(0.04)
Academic/NGO-implemented
Constant
(4)
Effect size
0.177***
(0.03)
556
0.20
Standard errors are clustered by intervention-outcome. Different columns contain different numbers of
observations because not all studies reported the number of observations on which their estimate was based.
26
Table 7 shows the results of OLS regressions of the effect size on these variables, in turn.
The baseline enrollment rates show the strongest relationship to effect size, as reflected in
the R2 and significance levels: it is easier to have large gains where initial rates are low.
Some papers pay particular attention to those children that were not enrolled at baseline
or that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate
at baseline but are also represented by two dummy variables (Column 2). Larger studies
and studies done in urban areas also tend to find smaller effect sizes than smaller studies
or studies done in rural or mixed urban/rural areas. Finally, for each result I calculate the
mean result in the same region, excluding results from the program in question. Results do
appear slightly correlated across different programs in the same region.
27
Table 7: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates)
Enrollment Rates
(1)
ES
(2)
ES
-0.224***
(0.05)
-0.092
(0.06)
-0.002
(0.02)
0.183***
(0.05)
Enrolled at Baseline
Not Enrolled at
Baseline
Number of
Observations (100,000s)
Rural
(3)
ES
(4)
ES
(5)
ES
(6)
ES
(7)
ES
(8)
ES
(10)
ES
-0.127***
(0.02)
0.142***
(0.03)
-0.002
(0.00)
0.002
(0.02)
-0.039**
(0.02)
-0.011*
(0.01)
0.049**
(0.02)
Urban
28
-0.068***
(0.02)
Girls
-0.002
(0.03)
Boys
-0.019
(0.02)
Minimum Sample Age
0.005
(0.01)
Mean Regional Result
Observations
R2
(9)
ES
112
0.41
112
0.52
108
0.01
130
0.06
130
0.05
130
0.00
130
0.01
104
0.02
1.000**
(0.38)
0.714**
(0.28)
130
0.01
92
0.58
Each column reports the results of regressing the effect size (ES) on different explanatory variables. Multiple results for different subgroups may be
reported for the same paper; the data on which this table is based includes multiple results from the same paper for different subgroups that are
non-overlapping (e.g. boys and girls, groups with different age ranges, or different geographical areas). Standard errors are clustered by paper. Not
every paper reports every explanatory variable, so different columns are based on different numbers of observations.
Table 8: Impact of Mixed Models on Measures
var(Yi )
Random effects model 0.011
Mixed model 1
0.011
Mixed model 2
0.012
varR pYi ´ Ypi q
0.011
0.007
0.005
CV(Yi )
1.24
1.28
1.25
CVR pYi ´ Ypi q
1.24
1.04
0.85
I2
IR2
N
0.97 0.97 122
0.97 0.96 104
0.96 0.93 87
As baseline enrollment rates have the strongest relationship to effect size, I use this as an
explanatory variable in a hierarchical mixed model (specification of Column 1), to explore
how it affects the residual varR pYi ´ Ypi q, CVR pYi ´ Ypi q and IR2 . I also use the specification in
Column 10 of Table 7 as a robustness check. The results are reported in Table 8 for each of
these two mixed models, alongside the values from the random-effects model that does not
use any explanatory variables.
Not all papers provide information for each explanatory variable, and each row is based
on only those studies which could be used to estimate the model. Thus, the value of varpYi q,
CV(Yi ) and I 2 , which do not depend on the model used, may still vary between rows.
In the random-effects model, since no explanatory variables are used, Ypi is only the mean,
and varR pYi ´ Ypi q, CVR pYi ´ Ypi q and IR2 do not offer improvements on var(Yi ), CV(Yi ) and I 2 .
As more explanatory variables are added, the gap between varpYi q and varR pYi ´ Ypi q, CV(Yi )
and CVR pYi ´ Ypi q and I 2 and IR2 grows. In all cases, including explanatory variables can
help reduce the unexplained variation, to varying degrees. varR pYi ´ Ypi q and CVR pYi ´ Ypi q
are greatly reduced from var(Yi ) and CV(Yi ), but IR2 is not much lower than I 2 . This is
likely due to a feature of I 2 (IR2 ) previously discussed: that it depends on the precision of
estimates. With evaluations of CCT programs tending to have large sample sizes, the value
of I 2 (IR2 ) is higher than it otherwise would be.
5.2.3
Removing Sampling Variance
Finally, it may be worthwhile to point out that sampling variance can artificially inflate
the variance of studies’ results. By how much? I examine this question using the case of
deworming programs.
Here, we know that many factors may affect the results a study obtains. For example,
whether the study was randomized by cluster or by individual would have ramifications for
the overall burden of disease and likelihood of infection in a particular area (Miguel and
Kremer, 2004). Apart from this, initial burdens of disease may vary from place to place,
and the exact dosage schedule was also different between different studies. Despite these
and other differences, it is remarkable how much of the variation in treatment effects can be
attributed to sampling variance. Table 9 presents the variance of Yi and the portion that is
29
not attributed to sampling variance, τ 2 . It then shows the difference between the two and
the proportion of var(Yi ) that can be attributed to sampling variance - 46%, on average. It
then presents similar statistics for the coefficient of variation, generating a measure, CVτ ,
which is the coefficient of variation if τ were used in the place of the standard deviation of
Yi . An average of 29% of CV(Yi ) can be explained simply by sampling variance.
Clearly, considering sampling variance can reduce the observed heterogeneity measures.
Further, it does not even require the potentially time-consuming and expensive collection of
data on additional explanatory variables.
30
Table 9: Deworming Case Study: The Difference Made by Adjusting for Sampling Error
Outcome
Height
Height-for-age
Hemoglobin
Mid-upper arm circumference
Weight
Weight-for-age
Weight-for-height
var(Yi )
τ2
0.049
0.098
0.015
0.439
0.184
0.107
0.072
0.044
0.073
0.003
0.083
0.082
0.072
0.047
Difference Percent difference
in variance
in variance
0.005
0.10
0.026
0.26
0.012
0.83
0.356
0.81
0.102
0.56
0.036
0.33
0.026
0.35
CV(Yi )
CVτ
2.36
1.98
3.38
1.77
4.76
2.29
3.13
2.24
1.70
1.40
0.77
3.17
1.87
2.52
Difference
in CV
0.12
0.28
1.98
1.00
1.59
0.42
0.61
Percent difference
in CV
0.05
0.14
0.59
0.57
0.33
0.18
0.20
31
6
Discussion
Why should we care about heterogeneity in treatment effects?
Impact evaluations are used to inform policy decisions. We need to know how best to
extrapolate from them in order to make the most effective policy choices. There has been
a great controversy over external validity, and this paper brings data from many different
types of interventions to bear on the question.
We saw that whether we looked at benchmarks from other literatures (CVs typically
falling between 0.05 and 0.5; I 2 being considered “high” at 0.75) or whether we constructed
additional benchmarks based on the amount of variation that would result in substantial prediction errors, the degree of heterogeneity between different studies’ results was quite high.
Another statistic that supports this conclusion is the absolute value of the prediction error,
|Yi ´ Ŷi |, at time t ` 1, using all the data up to time t to predict Yi . If we use the mean Yi up
to time t in each intervention-outcome as the simplest, naive predictor, the average absolute
value of the error is 0.18, compared to an average effect size of 0.12. The median absolute
value of the error is 0.10, or in percent terms, the median amount by which the prediction is
off is 97%. It would seem difficult to plan policy with the typical error approaching 100%.8
Whether this is “enough” evidence on which to base a policy decision of whether or not
to adopt an intervention in a new setting depends on the alternative policies that could
be adopted and the amount of information that could be gained by an additional impact
evaluation.
This paper focused on both the marginal improvements in the predicted expected value
of a program as well as on the range of results a program is likely to have. The benefits of
improving estimates of θi are obvious, but the distribution of results might also be critically
important.
First, different programs exhibit different levels of heterogeneity, and policymakers may
be more risk-averse or risk-loving. There are several reasons one might be risk-averse. Practically speaking, it is possible that if one does not obtain a good result, one will not get
another shot. If a program fails, the people that were targeted may not be interested in
taking up a second attempt. If it is a government program, the policymaker could also be
voted out of office. Policymakers might also be risk-averse if they would like to maximize
beneficiaries’ utility and think that the beneficiaries themselves would be risk-averse. On
the other hand, if a policymaker believed that beneficiaries were stuck in a poverty trap and
needed a large enough push to break out of the trap, they might instead be risk-loving.
Another reason to care about the full range of a study’s possible results is more insidu8
Imagine: “This policy will increase A by B. Or maybe it will do nothing at all. It’s a toss-up.”
32
ous: we might have a bias towards remembering the larger results. Those studies with larger
effect sizes are better-cited in the data, and if policymakers are optimistic they may also put
more weight on those studies that find the largest effects. This could bias policy towards
those interventions that are more heterogeneous rather than more efficacious.9
Finally, the large errors call into question our standard approach to impact evaluation.
It might still be worthwhile to conduct impact evaluations with a large prediction error, but
several factors should be considered. First, the costs of the evaluation should be weighed
against the benefits. Some evaluations are relatively low-cost; others are not. The average
World Bank impact evaluation costs $500,000 (IEG, 2012). In our toy example with θ˚ “ 0.1,
an additional impact evaluation was pivotal 11% of the time, leading to improvements of
about 0.1 - 5.7% in those cases depending on how many impact evaluations had previously
been done. A policymaker would charitably have to be willing to spend $500,000 for an
increase in the effect size of approximately half a percentage point in this scenario. Spillover
effects to other policymakers could also be considered in this calculation.
7
Conclusion
How much impact evaluation results generalize to other settings is an important topic,
and data from meta-analyses are the ideal data with which to answer this question. With
data on 20 different types of interventions, all collected in the same way, we can begin to
speak a bit more generally about how results tend to vary across contexts and what that
implies for impact evaluation design and policy recommendations.
Smaller studies tended to have larger effect sizes, which we might expect if the smaller
studies are better-targeted, are selected to be evaluated when there is a higher a priori expectation they will have a large effect size, or if there is a preference to report larger effect
sizes, which smaller studies would obtain more often by chance. Government-implemented
programs also had smaller effect sizes than academic/NGO-implemented programs, even
after controlling for sample size. This is unfortunate given we often do smaller impact evaluations with NGOs in the hopes of finding a strong positive effect that can scale through
government implementation.
I also compared within-paper heterogeneity in treatment effects to across-paper heterogeneity in treatment effects. Within-paper heterogeneity is present in my data as papers often
9
If we are indeed more likely to remember larger effect sizes, this could also help to explain the very large
gap between the sample sizes of the studies in the data set and the sample size they would have required
in order to have power = 0.8 at the beginning of the study had the authors’ priors been equal to the effect
sizes actually found (Coville and Vivalt, 2015). Either researchers are conducting studies that they know
are underpowered a priori, or else effect sizes are lower than they expect (or some combination of the two).
33
report multiple results for the same outcomes, such as for different subgroups. Fortunately,
I find that even these crude measures of within-paper heterogeneity predict across-paper
heterogeneity for the relevant intervention-outcome. This implies that researchers can get
a quick estimate how well their results would apply to other settings, simply by using their
own data. With access to micro data, authors could do much richer analysis.
Since we may be concerned with the residual amount of heterogeneity after taking the
best-fitting model into consideration, I discussed the case of the effect of CCTs on enrollment rates. The generalizability measures improve with the addition of an explanatory mixed
model.
I consider several ways to evaluate the magnitude of the variation in results. Whether
results are too heterogeneous ultimately depends on the purpose for which they are being
used; some policy decisions might have greater room for error than others. However, it is safe
to say, looking at both the coefficient of variation and the I 2 , which have commonly accepted
benchmarks in other disciplines, that these impact evaluations exhibit more heterogeneity
than is typical in other fields, such as medicine, even after accounting for explanatory variables in the case of conditional cash transfers. Further, I find that under mild assumptions,
the typical variance of results is such that a particular program would be mis-predicted by
more than 50% over 80% of the time.
There are some steps that researchers can take that may improve the generalizability
of their own studies. First, just as with heterogeneous selection into treatment (Chassang,
Padró i Miquel and Snowberg, 2012), one solution would be to ensure one’s impact evaluation varied some of the contextual variables that we might think underlie the heterogeneous
treatment effects. Given that many studies are underpowered as it is, that may not be
likely; however, large organizations and governments have been supporting more impact
evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to
coordinate across different studies, asking the same questions or looking at some of the same
outcome variables, would also help. The framing of heterogeneous treatment effects could
also provide positive motivation for replication projects in different contexts: different findings would not necessarily negate the earlier ones but add another level of information.
In summary, generalizability is not binary but something that we can measure. This
paper showed that past results have significant but limited ability to predict other results
on the same topic and this was not seemingly due to bias. Knowing how much results tend
to extrapolate and when is critical if we are to know how to interpret an impact evaluation’s
results or apply its findings. Given that other fields, with less heterogeneity, also seem to
have a more well-developed practice of replication and meta-analysis, it would seem like
economics would have a lot to gain by expanding in this direction.
34
References
AidGrade (2013). “AidGrade Process Description”, http://www.aidgrade.org/methodology/
processmap-and-methodology, March 9, 2013.
AidGrade (2015). “AidGrade Impact Evaluation Data, Version 1.2”.
Alesina, Alberto and David Dollar (2000). “Who Gives Foreign Aid to Whom and Why?”,
Journal of Economic Growth, vol. 5 (1).
Allcott, Hunt (forthcoming).
“Site Selection Bias in Program Evaluation”,
Quarterly Journal of Economics.
Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). “Wishful Thinking: Belief,
Desire, and the Motivated Evaluation of Scientific Evidence”, Psychological Science.
Becker, Betsy Jane and Meng-Jia Wu (2007). “The Synthesis of Regression Slopes in
Meta-Analysis”, Statistical Science, vol. 22 (3).
Bold, Tessa et al. (2013). “Scaling-up What Works: Experimental Evidence on External
Validity in Kenyan Education”, working paper.
Borenstein, Michael et al. (2009). Introduction to Meta-Analysis. Wiley Publishers.
Boriah, Shyam et al. (2008). “Similarity Measures for Categorical Data: A Comparative
Evaluation”, in Proceedings of the Eighth SIAM International Conference on Data Mining.
Brodeur, Abel et al. (2012). “Star Wars: The Empirics Strike Back”, working paper.
Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy
and Economics. Cambridge: Cambridge University Press.
Cartwright, Nancy (2010). “What Are Randomized Controlled Trials Good For?”,
Philosophical Studies, vol. 147 (1): 59-70.
Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). “Reshaping Institutions:
Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, vol.
127 (4): 1755-1812.
Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012).
“Selective Trials: A Principal-Agent Approach to Randomized Controlled Experiments.”
American Economic Review, vol. 102 (4): 1279-1309.
Cohen, Jacob (1988).
Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Earlbaum Associates.
Coville, Aidan and Eva Vivalt (2015). “When Should You Change Your Priors When a
New Study Comes Out?”, working paper.
Deaton, Angus (2010). “Instruments, Randomization, and Learning about Development.”
Journal of Economic Literature, vol. 48 (2): 424-55.
Dehejia, Rajeev, Cristian Pop-Eleches and Cyrus Samii (2015). “From Local to Global:
External Validity in a Fertility Natural Experiment”, working paper.
35
Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). “School Governance, Teacher
Incentives and Pupil-Teacher Ratios: Experimental Evidence from Kenyan Primary
Schools”, NBER Working Paper.
Evans,
David and Anna Popova (2014).
“Cost-effectiveness Measurement in Development:
Accounting for Local Costs and Noisy Impacts”,
World Bank Policy Research Working Paper, No. 7027.
Ferguson, Christopher and Michael Brannick (2012). “Publication bias in psychological
science: Prevalence, methods for identifying and controlling, and implications for the use of
meta-analyses.” Psychological Methods, vol. 17 (1), Mar 2012, 120-128.
Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). “Publication Bias in the Social
Sciences: Unlocking the File Drawer”, Working Paper.
Gerber, Alan and Neil Malhotra (2008a). “Do Statistical Reporting Standards Affect
What Is Published? Publication Bias in Two Leading Political Science Journals”,
Quarterly Journal of Political Science, vol 3.
Gerber, Alan and Neil Malhotra (2008b).
“Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results”,
Sociological Methods &Research, vol. 37 (3).
Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and
Hall/CRC.
Hedges, Larry and Therese Pigott (2004). “The Power of Statistical Tests for Moderators
in Meta-Analysis”, Psychological Methods, vol. 9 (4).
Higgins,
Julian
PT
and
Sally
Green,
(eds.)
(2011).
Cochrane Handbook for Systematic Reviews of
Interventions, Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. Available
from www.cochrane-handbook.org.
Higgins, Julian PT et al. (2003). “Measuring inconsistency in meta-analyses”, BMJ 327:
557-60.
Higgins, Julian PT and Simon Thompson (2002). “Quantifying heterogeneity in a metaanalysis”, Statistics in Medicine, vol. 21: 1539-1558.
Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). “Quantifying the Influence
of Climate on Human Conflict”, Science, vol. 341.
Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance
and Effectiveness”, World Bank Group.
Innovations for Poverty Action (2015). “IPA Launches the Goldilocks Project: Helping
Organizations Build Right-Fit M&E Systems”, http://www.poverty-action.org/goldilocks.
Jadad, A.R. et al. (1996). “Assessing the quality of reports of randomized clinical trials: Is
36
blinding necessary?” Controlled Clinical Trials, 17 (1): 112.
Millennium Challenge Corporation (2009). “Key Elements of Evaluation at MCC”,
presentation June 9, 2009.
Miguel, Edward and Michael Kremer (2004). “Worms: identifying impacts on education
and health in the presence of treatment externalities”, Econometrica, vol. 72(1).
Ng,
CK (2014).
“Inference on the common coefficient of variation
when
populations
are
lognormal:
A
simulation-based
approach”,
Journal of Statistics: Advances in Theory and Applications, vol. 11 (2).
Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). “Many Scenarios Exist
for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic
Reviews”, Journal of Clinical Epidemiology, vol. 66 (5).
Pritchett, Lant and Justin Sandefur (2013). “Context Matters for Size: Why External
Validity Claims and Development Practice Don’t Mix”, Center for Global Development
Working Paper 336.
Pritchett, Lant, Salimah Sanji and Jeffrey Hammer (2013).
“It’s All About
MeE: Using Structured Experiential Learning (“e”) to Crawl the Design Space”,
Center for Global Development Working Paper 233.
Rodrik, Dani (2009). “The New Development Economics: We Shall Experiment, but How
Shall We Learn?”, in What Works in Development? Thinking Big, and Thinking Small, ed.
Jessica Cohen and William Easterly, 24-47. Washington, D.C.: Brookings Institution Press.
Saavedra, Juan and Sandra Garcia (2013). “Educational Impacts and Cost-Effectiveness
of Conditional Cash Transfer Programs in Developing Countries: A Meta-Analysis”,
CESR Working Paper.
Shadish,
William,
Thomas
Cook
and
Donald
Campbell
(2002).
Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston:
Houghton Mifflin.
Simmons, Joseph and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed
Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”,
Psychological Science, vol. 22.
Simonsohn, Uri et al.
(2014).
“P-Curve:
A Key to the File Drawer”,
Journal of Experimental Psychology: General.
Tian, Lili (2005). “Inferences on the common coefficient of variation”, Statistics in Medicine,
vol. 24: 2213-2220.
Tibshirani, Ryan and Robert Tibshirani (2009). “A Bias Correction for the Minimum Error
Rate in Cross-Validation”, Annals of Applied Statistics, vol. 3 (2).
Tierney, Michael J. et al. (2011). “More Dollars than Sense: Refining Our Knowledge of
37
Development Finance Using AidData”, World Development, vol. 39.
Tipton, Elizabeth (2013).
“Improving generalizations from experiments using propensity score subclassification:
Assumptions, properties, and contexts”,
Journal of Educational and Behavioral Statistics, 38: 239-266.
RePEc (2013). “RePEc h-index for journals”, http://ideas.repec.org/top/
top.journals.hindex.html.
Vivalt, Eva (2015a). “The Trajectory of Specification Searching Across Disciplines and
Methods”, Working Paper.
Vivalt, Eva (2015b). “How Concerned Should We Be About Selection Bias, Hawthorne
Effects and Retrospective Evaluations?”, Working Paper.
Walsh, Michael et.
al.
(2013).
“The Statistical Significance of Randomized
Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index”,
Journal of Clinical Epidemiology.
USAID (2011). “Evaluation: Learning from Experience”, USAID Evaluation Policy,
Washington, DC.
38
For Online Publication
39
Appendices
A
A.1
Guide to Appendices
Appendices in this Paper
B) Excerpt from AidGrade’s Process Description (2013).
C) Discussion of heterogeneity measures.
D) Additional results.
E) Derivation of mixed model estimation strategy.
A.2
Further Online Appendices
Having to describe data from twenty different meta-analyses and systematic reviews, I must rely in part on online appendices.
The following are available at
http://www.evavivalt.com/research:
F) The search terms and inclusion criteria for each topic.
G) The references for each topic.
H) The coding manual.
40
B
B.1
Data Collection
Description of AidGrade’s Methodology
The following details of AidGrade’s data collection process draw heavily from AidGrade’s
Process Description (AidGrade, 2013).
Figure 7: Process Description
Stage 1: Topic Identification
AidGrade staff members were asked to each independently make a list of at least
thirty international development programs that they considered to be the most interesting.
The independent lists were appended into one document and duplicates were tagged and
removed. Each of the remaining topics was discussed and refined to bring them all to a clear
41
and narrow level of focus. Pilot searches were conducted to get a sense of how many impact
evaluations there might be on each topic, and all the interventions for which the very basic
pilot searches identified at least two impact evaluations were shortlisted. A random subset
of the topics was selected, also acceding to a public vote for the most popular topic.
Stage 2: Search
Each search engine has its own peculiarities. In order to ensure all relevant papers
and few irrelevant papers were included, a set of simple searches was conducted on
different potential search engines. First, initial searches were run on AgEcon; British
Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar;
IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect;
SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library;
and the World Bank eLibrary. The list of potential search engines was compiled broadly
from those listed in other systematic reviews. The purpose of these initial searches was to
obtain information about the scope and usability of the search engines to determine which
ones would be effective tools in identifying impact evaluations on different topics. External
reviews of different search engines were also consulted, such as a Falagas et al. (2008) study
which covered the advantages and differences between the Google Scholar, Scopus, Web of
Science and PubMed search engines.
Second, searches were conducted for impact evaluations of two test topics: deworming
and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed,
ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank eLibrary
were used for these searches. 9 search strings were tried for deworming and up to 33 strings
for toilets, with modifications as needed for each search engine. For each search the number
of results and the number of results out of the first 10-50 results which appeared to be
impact evaluations of the topic in question were recorded. This gave a better sense of which
search engines and which kinds of search strings would return both comprehensive and
relevant results. A qualitative assessment of the search results was also provided for the
Google Scholar and SciVerse searches.
Finally, the online databases of J-PAL, IPA, CEGA and 3ie were searched. Since these
databases are already narrowly focused on impact evaluations, attention was restricted to
simple keyword searches, checking whether the search engines that were integrated with
each database seemed to pull up relevant results for each topic.
Ultimately, Google Scholar and the online databases of J-PAL, IPA, CEGA and 3ie,
along with EBSCO/PubMed for health-related interventions, were selected for use in the
full searches.
42
After the interventions of interest were identified, search strings were developed and
tested using each search source. Each search string included methodology-specific stock
keywords that narrowed the search to impact evaluation studies, except for the search
strings for the J-PAL, IPA, CEGA and 3ie searches, as these databases already exclusively
focus on impact evaluations.
Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the
development of the search strings. The search strings could take slightly different forms for
different search engines. Search terms were tailored to the search source, and a full list is
included in an appendix.
C# was used to write a script to scrape the results from search engines. The script
was programmed to ensure that the Boolean logic of the search string was properly applied
within the constraints of each search engines capabilities.
Some sources were specialized and could have useful papers that do not turn up in
simple searches. The papers listed on J-PAL, IPA, CEGA and 3ies websites are a good
example of this. For these sites, it made more sense for the papers to be manually searched
and added to the relevant spreadsheets. After the automated and manual searches were
complete, duplicates were removed by matching on author and title names.
During the title screening stage, the consolidated list of citations yielded by the scraped
searches was checked for any existing meta-analyses or systematic reviews. Any papers that
these papers included were added to the list. With these references added, duplicates were
again flagged and removed.
Stage 3: Screening
Generic and topic-specific screening criteria were developed. The generic screening criteria are detailed below, as is an example of a set of topic-specific screening criteria.
The screening criteria were very inclusive overall. This is because AidGrade purposely
follows a different approach to most meta-analyses in the hopes that the data collected can
be re-used by researchers who want to focus on a different subset of papers. Their motivation is that vast resources are typically devoted to a meta-analysis, but if another team of
researchers thinks a different set of papers should be used, they will have scour the literature
and recreate the data from scratch. If the two groups disagree, all the public sees are their
two sets of findings and their reasoning for selecting different papers. AidGrade instead
strives to cover the superset of all impact evaluations one might wish to include along with a
list of their characteristics (e.g. where they were conducted, whether they were randomized
by individual or by cluster, etc.) and let people set their own filters on the papers or select
individual papers and view the entire space of possible results.
43
Figure 8: Generic Screening Criteria
Category
Methodologies
Inclusion Criteria
Impact evaluations that have counterfactuals
Publication status
Time period of study
LocationGeography
Quality
Peer-reviewed or working paper
Any
Any
Any
Exclusion Criteria
Observational studies,
strictly qualitative studies
N/A
N/A
N/A
N/A
Figure 9: Topic-Specific Criteria Example: Formal Banking
Category
Intervention
Outcomes
Inclusion Criteria
Formal banking services specifically including:
- Expansion of credit and/or savings
- Provision of technological innovations
- Introduction or expansion of financial education,
or other program to increase financial literacy
or awareness
- Individual and household income
- Small and micro-business income
- Household and business assets
- Household consumption
- Small and micro-business investment
- Small, micro-business or agricultural output
- Measures of poverty
- Measures of well-being or stress
- Business ownership
- Any other outcome covered by multiple papers
Exclusion Criteria
Other formal banking services
Microfinance
N/A
Figure 10 illustrates the difference.
For this reason, minimal screening was done during the screening stage. Instead, data
was collected broadly and re-screening was allowed at the point of doing the analysis. This
is highly beneficial for the purpose of this paper, as it allows us to look at the largest
possible set of papers and all subsets.
After screening criteria were developed, two volunteers independently screened the titles
to determine which papers in the spreadsheet were likely to meet the screening criteria
developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All
volunteers received training before beginning, based on the AidGrade Training Manual and
a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient
44
Figure 10: AidGrade’s Strategy
45
volunteers would be allowed to continue. Of those papers that passed the title screening,
two volunteers independently determined whether the papers in the spreadsheet met the
screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in
coding were again arbitrated by a third volunteer. The full text was then found for those
papers which passed both the title and abstract checks. Any paper that proved not to
be a relevant impact evaluation using the aforementioned criteria was discarded at this stage.
Stage 4: Coding
Two AidGrade members each independently used the data extraction form developed
in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any
disputes were arbitrated by a third AidGrade member. These AidGrade members received
much more training than those who screened the papers, reflecting the increased difficulty
of their work, and also did a test set of entries before being allowed to proceed. The data
extraction form was organized into three sections: (1) general identifying information; (2)
paper and study characteristics; and (3) results. Each section contained qualitative and
quantitative variables that captured the characteristics and results of the study.
Stage 5: Analysis
A researcher was assigned to each meta-analysis topic who could specialize in determining which of the interventions and results were similar enough to be combined. If in doubt,
researchers could consult the original papers. In general, researchers were encouraged to
focus on all the outcome variables for which multiple papers had results.
When a study had multiple treatment arms sharing the same control, researchers would
check whether enough data was provided in the original paper to allow estimates to be
combined before the meta-analysis was run. This is a best practice to avoid double-counting
the control group; for details, see the Cochrane Handbook for Systematic Reviews of
Interventions (2011). If a paper did not provide sufficient data for this, the researcher would
make the decision as to which treatment arm to focus on. Data were then standardized
within each topic to be more comparable before analysis (for example, units were converted).
The subsequent steps of the meta-analysis process are irrelevant for the purposes of
this paper. It should be noted that the first set of ten topics followed a slightly different
procedure for stages (1) and (2). Only one list of potential topics was created in Stage
1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no
randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3),
as all searches were manually conducted using specific strings. A different search engine was
46
also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed
Central, ArXiv.org, and many other databases of articles, books and presentations. The
search strings for both rounds of meta-analysis, manual and scripted, are detailed in another
appendix.
47
C
Heterogeneity Measures
As discussed in the main text, we will want measures that speak to both the overall
variability as well as the amount that can be explained.
The most obvious measure to consider is the variance of studies’ results, var(Yi ). A
potential drawback to using the variance as a measure of generalizability is that we might be
concerned that studies that have higher effect sizes or are measured in terms of units with
larger scales have larger variances. This would limit us to making comparisons only between
data with the same scale. We could either: 1) restrict attention to those outcomes in the
same natural units (e.g. enrollment rates in percentage points); 2) convert results to be in
terms of a common unit, such as standard deviations10 ; or 3) scale the measure, such as by
the mean result, to create a unitless figure. Scaling the standard deviation of results within
an intervention-outcome combination by the mean result within that intervention-outcome
creates a measure known as the coefficient of variation, which represents the inverse of the
signal-to-noise ratio, and as a unitless figure can be compared across intervention-outcome
combinations with different natural units. It is not immune to criticism, however, particularly
in that it may result in large values as the mean approaches zero.
The measures discussed so far focus on variation. However, if we could explain the
variation, it would no longer worsen our ability to make predictions in a new setting, so
long as we had all the necessary data from that setting, such as covariates, with which to
extrapolate. One portion of the variation that can be immediately explained is the sampling
variance, varpYi |θi q, denoted σ 2 . The variation in observed effect sizes is:
varpYi q “ τ 2 ` σ 2
(15)
and the proportion of the variation that is not sampling error is:
I2 “
τ2
τ 2 ` σ2
(16)
The I 2 is an established metric in the meta-analysis literature that helps determine
whether a fixed or random-effects model is more appropriate; the higher I 2 , the less plausible it is that sampling error drives all the variation in results, and the more appropriate a
random-effects model is. I 2 is considered “low” at 0.25, “moderate” at 0.5, and “high” at
0.75 (Higgins et al., 2003).11
10
This can be problematic if the standard deviations themselves vary but is a common approach in the
meta-analysis literature in lieu of a better option.
11
The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”,
0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75-
48
Table 10: Summary of heterogeneity measures
Measure of variation
varpYi q
varR pYi ´ Ypi q
CVpYi q
CVR pYi ´ Ypi q
I2
IR2
R2
Measure of proportion
of variation that is
systematic
X
X
X
X
Measure makes use of
explanatory variables
X
X
X
X
X
X
X
If we wanted to explain more of the variation, we could use a mixed model and, upon
estimating it, we can calculate several additional statistics: the amount of residual variation
in Yi , after accounting for Xn , varR pYi q, the coefficient of residual variation, CVR pYi q, and
the residual IR2 . Further, we can examine the R2 of the meta-regression.
It should be noted that a linear meta-regression is only one way of modelling variation in
Yi . The I 2 , for example, is analogous to the reliability coefficient of classical test theory or
the generalizability coefficient of generalizability theory (a branch of psychometrics), both
of which estimate the proportion of variation that is not error. In this literature, additional
heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling
variation in treatment effects also does not have to occur only retrospectively at the conclusion of studies; we can imagine that a carefully-designed study could anticipate and estimate
some of the potential sources of variation experimentally.
Table 10 summarizes the different indicators, dividing them into measures of variation
and measures of the proportion of variation that is systematic.
Each of these metrics has its advantages and disadvantages. Table 11 summarizes the
desirable properties of a measure of heterogeneity and which properties are possessed by each
of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the
Yi to have comparable units. Using the coefficient of variation requires the assumption that
the mean effect size is an appropriate measure with which to scale sd(Yi ). The variance and
coefficient of variation also do not have anything to say about the amount of heterogeneity
that can be explained. Adding explanatory variables also has its limitations. In any model,
we have no way to guarantee that we are indeed capturing all the relevant factors. While
I 2 has the nice property that it disaggregates sampling variance as a source of variation,
estimating it depends on the weights applied to each study’s results and thus, in turn, on
1 “considerable heterogeneity” (Higgins and Green, 2011).
49
Table 11: Desirable properties of a measure of heterogeneity
Does not depend
on the number of
studies in a cell
varpYi q
varR pYi ´ Ypi q
CVpYi q
CVR pYi ´ Ypi q
I2
IR2
R2
X
X
X
X
X
X
X
Does not depend
on the precision
of individual estimates
X
X
X
X
X
Does not depend
on the estimates’
units
Does not depend
on the mean result in the cell
X
X
X
X
X
X
X
X
X
X
A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its
standard error.
the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can
be artificially inflated by over-fitting.
To get a full picture of the extent to which results might generalize, then, multiple
measures should be used.
50
D
Additional Results
51
52
Intervention
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
Conditional cash transfers
HIV/AIDS Education
HIV/AIDS Education
HIV/AIDS Education
Unconditional cash transfers
Unconditional cash transfers
Unconditional cash transfers
Insecticide-treated bed nets
Contract teachers
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Deworming
Financial literacy
Improved stoves
Improved stoves
Improved stoves
Improved stoves
Irrigation
Irrigation
Table 12: Descriptive Statistics: Standardized Narrowly Defined Outcomes
Outcome
# Neg sig papers # Insig papers
Attendance rate
0
6
Enrollment rate
0
6
Height
0
1
Height-for-age
0
6
Labor force participation
1
12
Probability unpaid work
1
0
Test scores
1
2
Unpaid labor
0
2
Weight-for-age
0
2
Weight-for-height
0
1
Pregnancy rate
0
2
Probability has multiple sex partners
0
1
Used contraceptives
1
6
Enrollment rate
0
3
Test scores
0
1
Weight-for-height
0
2
Malaria
0
3
Test scores
0
1
Attendance rate
0
1
Birthweight
0
2
Diarrhea incidence
0
1
Height
3
10
Height-for-age
1
9
Hemoglobin
0
13
Malformations
0
2
Mid-upper arm circumference
2
0
Test scores
0
0
Weight
3
8
Weight-for-age
1
6
Weight-for-height
2
7
Savings
0
2
Chest pain
0
0
Cough
0
0
Difficulty breathing
0
0
Excessive nasal secretion
0
1
Consumption
0
1
Total income
0
1
# Pos sig papers
9
31
1
1
5
4
2
3
0
1
0
1
3
8
1
0
6
2
1
0
1
4
4
2
0
5
2
7
5
2
3
2
2
2
1
1
1
# Papers
15
37
2
7
18
5
5
5
2
2
2
2
10
11
2
2
9
3
2
2
2
17
14
15
2
7
2
18
12
11
5
2
2
2
2
2
2
53
Microfinance
Microfinance
Microfinance
Microfinance
Microfinance
Micro health insurance
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Micronutrient supplementation
Mobile phone-based reminders
Mobile phone-based reminders
Performance pay
Rural electrification
Rural electrification
Rural electrification
Safe water storage
Scholarships
Scholarships
Scholarships
Assets
Consumption
Profits
Savings
Total income
Enrollment rate
Birthweight
Body mass index
Cough prevalence
Diarrhea incidence
Diarrhea prevalence
Fever incidence
Fever prevalence
Height
Height-for-age
Hemoglobin
Malaria
Mid-upper arm circumference
Mortality rate
Perinatal deaths
Prevalence of anemia
Stillbirths
Stunted
Test scores
Triceps skinfold measurement
Wasted
Weight
Weight-for-age
Weight-for-height
Appointment attendance rate
Treatment adherence
Test scores
Enrollment rate
Study time
Total income
Diarrhea incidence
Attendance rate
Enrollment rate
Test scores
0
0
1
0
0
0
0
0
0
1
0
0
1
3
5
7
0
2
0
1
0
0
0
1
1
0
4
1
0
1
1
0
0
0
0
0
0
0
0
3
2
3
3
3
1
4
1
3
5
5
2
2
22
23
11
2
9
12
5
6
4
5
2
0
2
19
23
18
0
3
2
1
1
2
1
1
2
2
1
0
1
0
2
1
3
4
0
5
1
0
2
7
8
29
0
7
0
0
9
0
0
7
1
0
13
10
8
2
1
1
2
2
0
1
1
3
0
4
2
5
3
5
2
7
5
3
11
6
2
5
32
36
47
2
18
12
6
15
4
5
10
2
2
36
34
26
3
5
3
3
3
2
2
2
5
2
School meals
School meals
School meals
Water treatment
Water treatment
Women’s empowerment programs
Women’s empowerment programs
Average
Enrollment rate
Height-for-age
Test scores
Diarrhea incidence
Diarrhea prevalence
Savings
Total income
0
0
0
0
0
0
0
0.6
3
2
2
1
1
1
0
4.2
0
0
1
1
5
1
2
3.2
3
2
3
2
6
2
2
7.9
54
Intervention
55
Micronutrients
Conditional Cash Transfers
Conditional Cash Transfers
Deworming
Micronutrients
Micronutrients
Micronutrients
School Meals
Micronutrients
Unconditional Cash Transfers
SMS Reminders
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Microfinance
Conditional Cash Transfers
Conditional Cash Transfers
Deworming
Micronutrients
Bed Nets
Scholarships
Conditional Cash Transfers
HIV/AIDS Education
Deworming
Deworming
Deworming
Micronutrients
Micronutrients
Deworming
Conditional Cash Transfers
Micronutrients
Table 13: Across-Paper vs. Mean Within-Paper Heterogeneity
Outcome
Across-paper Within-paper Across-paper Within-paper
var(Yi )
var(Yi )
CV(Yi )
CV(Yi )
Cough prevalence
0.001
0.006
1.017
3.181
Enrollment rate
0.009
0.027
0.790
0.968
Unpaid labor
0.009
0.004
0.918
0.853
Hemoglobin
0.009
0.068
1.639
8.687
Weight-for-height
0.010
0.005
2.252
*
Birthweight
0.010
0.011
0.974
0.963
Weight-for-age
0.010
0.124
2.370
0.713
Height-for-age
0.011
0.000
1.086
*
Height-for-age
0.012
0.042
2.474
3.751
Enrollment rate
0.014
0.014
1.223
*
Treatment adherence
0.022
0.008
1.479
0.672
Height
0.023
0.028
4.001
3.471
Stunted
0.024
0.059
1.085
24.373
Mortality rate
0.026
0.195
2.533
1.561
Weight
0.029
0.027
2.852
0.149
Fever prevalence
0.034
0.011
5.937
0.126
Total income
0.037
0.003
1.770
1.232
Probability unpaid work
0.046
0.386
1.419
0.408
Attendance rate
0.046
0.018
0.591
0.526
Height
0.048
0.112
1.845
0.211
Perinatal deaths
0.049
0.015
2.087
0.234
Malaria
0.052
0.047
0.650
4.093
Enrollment rate
0.053
0.026
1.094
1.561
Height-for-age
0.055
0.002
22.166
1.212
Used contraceptives
0.059
0.120
2.863
6.967
Weight-for-height
0.072
0.164
3.127
*
Height-for-age
0.100
0.005
2.043
1.842
Weight-for-age
0.108
0.004
2.317
1.040
Diarrhea incidence
0.135
0.016
2.844
1.741
Diarrhea prevalence
0.137
0.029
1.375
3.385
Weight
0.168
0.121
4.087
1.900
Labor force participation
0.790
0.047
2.931
4.300
Hemoglobin
2.650
0.176
2.982
0.731
Across-paper
I2
0.905
1.000
0.927
0.661
0.798
0.930
1.000
0.996
1.000
1.000
0.998
0.987
0.222
0.037
0.742
0.696
0.999
1.000
1.000
1.000
0.402
0.999
1.000
0.036
0.351
1.000
1.000
1.000
0.993
0.949
1.000
0.269
1.000
Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across and
within-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per
Within-paper
I2
1.000
0.714
0.828
0.835
0.603
0.982
0.580
0.849
0.417
0.398
0.644
0.564
0.033
0.008
0.081
0.005
1.000
0.489
0.196
0.740
0.009
0.574
0.559
0.693
0.485
0.975
0.806
0.827
0.861
0.653
0.918
0.605
1.000
intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be included
in the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly different
sample, the across-paper statistics diverge slightly from those reported in Table 3. Occasionally, within-paper measures of the mean equal or
approach zero, making the coefficient of variation undefined or unreasonable; * denotes those coefficients of variation that were either undefined or
greater than 1,000,000.
56
Intervention
Table 14: Heterogeneity Measures for RCTs
Outcome
ρ1,2
57
Conditional Cash Transfers
Micronutrients
Microfinance
Financial Literacy
Contract Teachers
Performance Pay
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Conditional Cash Transfers
Conditional Cash Transfers
Micronutrients
Deworming
Micronutrients
Water Treatment
Conditional Cash Transfers
Micronutrients
SMS Reminders
School Meals
Unconditional Cash Transfers
Micronutrients
Micronutrients
Micronutrients
Conditional Cash Transfers
Bed Nets
HIV/AIDS Education
Micronutrients
Micronutrients
Unpaid labor
Cough prevalence
Savings
Savings
Test scores
Test scores
Weight-for-height
Body mass index
Weight-for-age
Birthweight
Height-for-age
Test scores
Height-for-age
Hemoglobin
Mid-upper arm circumference
Diarrhea prevalence
Enrollment rate
Test scores
Treatment adherence
Test scores
Enrollment rate
Height
Mortality rate
Stunted
Attendance rate
Malaria
Used contraceptives
Weight
Perinatal deaths
0.00
0.01
0.01
0.01
0.01
0.01
0.00
0.00
0.01
0.01
0.03
0.03
0.02
0.01
0.02
0.01
0.00
0.00
0.01
0.05
0.01
-0.01
0.01
0.02
0.01
0.02
0.01
0.00
0.02
ρ5,6
ρ10,11
0.00
0.00
-0.01
0.00
0.00
0.00
0.01
0.00
0.00
0.01
0.00
0.00
0.00
0.01
0.01
0.01
0.02
0.01
0.00
0.01
0.01
0.02
0.00
-0.01
0.00
0.00
var(Yi )
CV(Yi )
I2
N
0.000
0.001
0.003
0.004
0.005
0.006
0.009
0.009
0.009
0.010
0.011
0.012
0.012
0.015
0.018
0.020
0.020
0.021
0.022
0.023
0.023
0.024
0.025
0.025
0.025
0.029
0.031
0.037
0.038
0.16
1.65
1.43
5.47
0.40
0.61
2.40
1.06
2.18
0.98
0.83
1.19
2.43
3.38
2.40
0.97
0.70
1.45
1.67
1.29
1.03
3.76
2.88
1.11
0.41
0.50
1.63
2.76
2.10
0.99
1.00
1.00
0.99
1.00
1.00
0.79
1.00
0.99
0.95
0.04
1.00
1.00
0.99
0.70
1.00
1.00
1.00
0.93
0.55
1.00
0.98
0.05
0.11
0.81
0.99
0.94
0.78
0.04
2
3
2
5
3
3
25
3
32
7
2
4
35
15
15
6
17
8
5
3
6
30
12
5
6
9
8
33
6
58
Deworming
Conditional Cash Transfers
Deworming
Micronutrients
Conditional Cash Transfers
Deworming
Micronutrients
Deworming
Micronutrients
Micronutrients
School Meals
Micronutrients
Deworming
Micronutrients
SMS Reminders
Deworming
Average
Height
Labor force participation
Weight-for-height
Stillbirths
Probability unpaid work
Height-for-age
Prevalence of anemia
Weight-for-age
Diarrhea incidence
Diarrhea prevalence
Enrollment rate
Fever prevalence
Weight
Hemoglobin
Appointment attendance rate
Mid-upper arm circumference
0.00
0.01
0.02
0.03
0.02
0.01
0.04
0.01
0.04
0.03
0.16
0.02
-0.05
0.00
0.01
0.00
0.02
0.01
0.00
0.03
0.00
0.00
0.01
0.01
0.02
-0.01
0.03
0.00
0.00
0.00
0.00
0.00
0.02
-0.03
0.03
0.00
0.01
0.00
0.049
0.051
0.072
0.075
0.082
0.098
0.100
0.107
0.109
0.111
0.143
0.146
0.184
0.219
0.224
0.439
2.36
2.15
3.13
3.04
1.19
1.98
0.84
2.29
3.30
1.21
1.24
3.08
4.76
1.47
2.91
1.77
1.00
1.00
1.00
0.01
1.00
1.00
0.91
1.00
1.00
0.99
0.40
0.78
1.00
1.00
0.94
1.00
17
6
11
4
2
14
14
12
11
6
2
5
18
45
3
7
0.060
1.90
0.84
11
Table 15: Heterogeneity Measures for Higher-Quality Studies
Intervention
Outcome
ρ1,2
ρ5,6
59
Micronutrients
SMS Reminders
Microfinance
Financial Literacy
Contract Teachers
Conditional Cash Transfers
Conditional Cash Transfers
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Deworming
Micronutrients
Water Treatment
Micronutrients
Micronutrients
School Meals
Micronutrients
Conditional Cash Transfers
Micronutrients
HIV/AIDS Education
Bed Nets
Micronutrients
Deworming
SMS Reminders
Micronutrients
Deworming
Deworming
Cough prevalence
Treatment adherence
Savings
Savings
Test scores
Labor force participation
Test scores
Body mass index
Weight-for-height
Weight-for-age
Birthweight
Height-for-age
Hemoglobin
Mid-upper arm circumference
Diarrhea prevalence
Test scores
Mortality rate
Test scores
Height
Attendance rate
Stunted
Used contraceptives
Malaria
Weight
Height
Appointment attendance rate
Perinatal deaths
Weight-for-height
Height-for-age
0.01
0.01
0.01
0.01
0.01
0.02
0.00
0.00
0.01
0.01
0.01
-0.01
0.00
0.01
0.01
0.00
0.01
0.04
0.01
0.03
0.02
0.01
0.03
-0.01
0.03
0.02
0.02
0.01
0.01
ρ10,11
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.00
-0.01
0.01
-0.01
0.02
-0.01
0.00
0.00
-0.01
0.00
0.00
var(Yi )
CV(Yi )
I2
N
0.001
0.002
0.003
0.004
0.005
0.006
0.006
0.009
0.009
0.009
0.009
0.013
0.015
0.019
0.020
0.021
0.022
0.023
0.025
0.025
0.025
0.031
0.038
0.039
0.049
0.053
0.060
0.072
0.098
1.65
0.47
1.43
5.47
0.40
0.24
0.58
1.06
2.30
2.05
0.73
2.32
3.38
2.34
0.97
1.45
7.28
1.29
3.73
0.34
1.11
1.63
0.56
2.75
2.73
0.55
3.66
3.13
1.98
1.00
0.32
1.00
0.99
1.00
0.02
1.00
1.00
0.79
0.99
0.99
1.00
1.00
0.73
1.00
1.00
0.11
0.54
0.97
0.28
0.11
0.94
0.99
0.72
1.00
0.79
0.21
1.00
1.00
3
3
2
5
3
2
3
3
24
30
4
32
15
14
6
8
10
3
29
3
5
8
7
31
16
2
4
11
14
Micronutrients
Deworming
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Deworming
Deworming
Average
Prevalence of anemia
Weight-for-age
Diarrhea incidence
Diarrhea prevalence
Fever prevalence
Stillbirths
Hemoglobin
Weight
Mid-upper arm circumference
0.03
-0.02
0.00
0.03
0.02
0.05
-0.01
-0.02
0.00
0.00
-0.02
0.04
0.03
0.01
0.00
0.00
0.01
0.00
0.00
-0.02
0.01
0.01
0.00
0.00
0.100
0.107
0.109
0.111
0.146
0.173
0.197
0.198
0.439
0.84
2.29
3.30
1.21
3.08
2.62
1.51
3.85
1.77
0.84
1.00
1.00
0.98
0.81
0.02
1.00
1.00
1.00
14
12
11
6
5
2
40
16
7
0.060
2.05
0.79
11
60
Table 16: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes, Including Outlier
Intervention
Outcome
ρ1,2
ρ5,6
ρ10,11
61
Microfinance
Rural Electrification
Micronutrients
Microfinance
Microfinance
Financial Literacy
Microfinance
Contract Teachers
Performance Pay
Micronutrients
Conditional Cash Transfers
Micronutrients
Micronutrients
Micronutrients
Micronutrients
Conditional Cash Transfers
Deworming
Micronutrients
Conditional Cash Transfers
Unconditional Cash Transfers
Water Treatment
SMS Reminders
Conditional Cash Transfers
School Meals
Micronutrients
Micronutrients
Micronutrients
Bed Nets
Conditional Cash Transfers
Assets
Enrollment rate
Cough prevalence
Total income
Savings
Savings
Profits
Test scores
Test scores
Body mass index
Unpaid labor
Weight-for-age
Weight-for-height
Birthweight
Height-for-age
Test scores
Hemoglobin
Mid-upper arm circumference
Enrollment rate
Enrollment rate
Diarrhea prevalence
Treatment adherence
Labor force participation
Test scores
Height
Mortality rate
Stunted
Malaria
Attendance rate
0.00
0.01
0.01
0.00
0.00
0.01
0.01
0.01
0.01
0.00
0.01
0.01
0.00
0.01
0.00
0.01
0.03
0.01
0.02
0.01
0.01
0.01
0.04
0.05
0.00
0.02
0.02
0.02
0.01
-0.01
0.00
0.00
0.00
0.00
0.00
-0.01
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.01
0.00
0.04
0.00
0.01
0.01
0.01
0.00
0.01
0.00
0.00
var(Yi )
CV(Yi )
I2
N
0.000
0.001
0.001
0.001
0.002
0.004
0.005
0.005
0.006
0.007
0.009
0.009
0.010
0.010
0.012
0.013
0.015
0.015
0.015
0.016
0.020
0.022
0.023
0.023
0.023
0.025
0.025
0.029
0.030
5.51
0.13
1.65
0.99
1.77
5.47
5.45
0.40
0.61
0.67
0.92
1.94
2.15
0.98
2.47
1.87
3.38
2.08
0.83
1.09
0.97
1.67
1.63
1.29
4.37
2.88
1.11
0.50
0.52
1.00
0.92
1.00
1.00
1.00
0.99
1.00
1.00
1.00
1.00
0.94
0.98
0.78
0.95
1.00
1.00
0.99
0.53
1.00
1.00
1.00
0.90
0.36
0.52
0.98
0.06
0.12
0.98
0.99
4
3
3
5
3
5
5
3
3
5
5
34
26
7
36
5
15
18
37
11
6
5
18
3
32
12
5
9
15
62
Micronutrients
HIV/AIDS Education
Micronutrients
Deworming
Micronutrients
Scholarships
Conditional Cash Transfers
Deworming
Micronutrients
School Meals
Micronutrients
Deworming
Deworming
Micronutrients
Micronutrients
Micronutrients
Deworming
Micronutrients
SMS Reminders
Deworming
Conditional Cash Transfers
Rural Electrification
Average
Weight
Used contraceptives
Perinatal deaths
Height
Test scores
Enrollment rate
Height-for-age
Weight-for-height
Stillbirths
Enrollment rate
Prevalence of anemia
Height-for-age
Weight-for-age
Diarrhea incidence
Diarrhea prevalence
Fever prevalence
Weight
Hemoglobin
Appointment attendance rate
Mid-upper arm circumference
Probability unpaid work
Study time
0.00
0.02
0.02
-0.03
0.00
0.01
0.05
0.01
0.03
0.12
0.01
0.02
-0.04
-0.03
0.03
0.02
-0.04
-0.02
0.01
0.00
0.02
0.08
-0.01
0.00
0.02
-0.02
0.00
0.01
0.01
0.01
-0.01
-0.01
-0.01
0.00
-0.01
0.00
-0.01
0.01
0.03
0.00
-0.01
0.00
0.00
0.03
0.03
0.00
0.01
0.00
0.00
0.034
0.036
0.038
0.049
0.052
0.053
0.055
0.072
0.075
0.081
0.095
0.098
0.107
0.109
0.111
0.146
0.184
0.215
0.224
0.439
0.609
0.997
2.70
3.12
2.10
2.36
1.69
0.69
22.17
3.13
3.04
1.14
0.79
1.98
2.29
3.30
1.21
3.08
4.76
1.44
2.91
1.77
6.42
1.10
0.68
0.46
0.04
1.00
1.00
1.00
0.04
1.00
0.02
0.01
0.78
1.00
1.00
1.00
0.96
0.81
1.00
1.00
0.99
1.00
0.96
0.03
36
10
6
17
10
5
7
11
4
3
15
14
12
11
6
5
18
46
3
7
5
3
0.083
2.52
0.80
12
Table 17: Regression of Studies’ Absolute Percent Difference from the Within-InterventionOutcome Mean on Chronological Order
(1)
(2)
(3)
(4)
ă500% ă1000% ă1500% ă2000%
Chronological order
Observations
R2
0.053
(0.21)
0.699
(0.47)
0.926
(0.61)
0.971
(0.67)
480
0.13
520
0.13
527
0.12
532
0.12
Each column restricts attention to a set of results a given maximum percentage away from the mean result
in an intervention-outcome combination.
63
E
Derivation of Mixed Model Estimation Strategy
The model we are estimating is of the form:
Yi “ Xi β ` ui ` ei
(17)
This can easily be generalized to include other explanatory variables, and I do this to
estimate Model 2. However, for the sake of exposition I focus on the simplest case.
In the fully Bayesian random-effects model, we estimated the parameters θ, µ and τ using
the fact that P pθ, µ, τ |Y q “ P pθ|µ, τ, Y qP pµ|τ, Y qP pτ |Y q and sampling from τ , calculating
µ, and finally obtaining θ.
Analogously, we can estimate the hierarchical mixed model by decomposing P pβ, e, τ |Y q:
P pβ, ei , τ |Yi q “ P pei |β, τ, Yi qP pβ|τ, Yi qP pτ |Yi q
(18)
Inspecting each term on the RHS separately, we can see a similar identification strategy:
sampling from τ , calculating β, calculating the sufficient statistics for ei , and finally sampling
ei . In particular, the three terms can be re-written as follows.
For the first term:
P pei |β, τ, Yi q9P pβ, ei , τ |Yi q
(19)
P pβ, ei , τ |Yi q9P pYi |β, ei qP pei |τ qP pβ, τ q
ˆ
ˆ
˙
˙
1
1 2
1
1
2
exp ´ 2 pYi ´ Xi β ´ ei q ?
9a
exp ´ 2 ei P pβ, τ q
2σi
2τ
2πτ 2
2πσi2
1
1
logP pei |β, τ, Yi q “ C ´ 2 pe2i ´ 2pYi ´ Xi βqei q ´ 2 e2i
2σi
2τ
2
2
σ `τ
Yi ´ X i β
“ C ´ i 2 2 e2i `
ei
2σi τ
σi2
ˆ
˙2
pYi ´ Xi βqτ 2
σi2 ` τ 2
“C´
ei ´
2σi2 τ 2
σi2 ` τ 2
where C is a constant in each line that can be different throughout.
For the second term:
64
(20)
(21)
(22)
(23)
(24)
P pβ|τ, Y q9P pβ, τ |Y q
n
ź
a
P pβ, τ |Y q9P pβ, τ q
(25)
ˆ
1
1
pYi ´ Xi βq2
exp ´
2
2q
2
2
2pσ
`
τ
2πpσ
`
τ
q
i
i
i“1
n
2
ÿ
pYi ´ Xi βq
logP pβ|τ, Y q “ logP pβ|τ q ´
`C
2pσi2 ` τ 2 q
i“1
˙
(26)
(27)
Gelman et al. (2013) suggest a noninformative prior for P pβ|τ q.
˜
logP pβ|τ, Y q “ C ´ β 1
n
ÿ
Xi1 Xi
2pσi2 ` τ 2 q
i“1
¸
˜
β`
n
ÿ
Yi Xi
pσi2 ` τ 2 q
i“1
¸
β
1
“ C ´ pβ ´ Ωλ1 q1 Ω´1 pβ ´ Ωλ1 q
2
where Ω “
´ř
n
Xi1 Xi
i“1 pσi2 `τ 2 q
¯´1
,λ“
(28)
(29)
řn
Yi Xi
i“1 pσi2 `τ 2 q .
For the third term:
P pτ |Y q “
P pβ, τ |Y q
P pβ|τ, Y q
(30)
For this last equation, note that P pβ|τ, Y q is solved above, and P pβ, τ |Y q is solved above
except for the unknown term P pβ, τ q “ P pβ|τ qP pτ q. We already defined a uniform prior
for P pβ|τ q and can define another uniform prior for P pτ q. Then:
n
n
ÿ
1ÿ
1
2
2
logP pτ |Y q “ C ´
logpσi ` τ q ´
pYi ´ Xi βq2
2
2
2 i“1
2pσi ` τ q
i“1
1
1
` log|Ω| ` pβ ´ Ωλ1 q1 Ω´1 pβ ´ Ωλ1 q
2
2
and we are ready to begin sampling τ .
65
(31)
(32)