Out of Sight, Not Out of Mind: Strategies for Handling

Comments

Transcription

Out of Sight, Not Out of Mind: Strategies for Handling
Out of Sight, Not Out of Mind:
Strategies for Handling Missing Data
Eric R. Buhi, MPH, PhD; Patricia Goodson, PhD; Torsten B. Neilands, PhD
Objective: To describe and illustrate missing data mechanisms
(MCAR, MAR, NMAR) and missing
data techniques (MDTs) and offer
recommended best practices for
addressing missingness. Method:
We simulated data sets and employed ad hoc MDTs (deletion techniques, mean substitution) and sophisticated MDTs (full information
maximum likelihood, Bayesian estimation, multiple imputation) in
linear regression analyses. Results:
MCAR data yielded unbiased pa-
N
o matter how carefully health behavior researchers plan their data
collection when using survey methodologies, they will always be faced with
missing data. Missingness—as the phenomenon is commonly referred to—may
result from lost surveys, respondent refusal to answer survey questions (eg, questions may be too sensitive), skipped questions, illegible responses, procedural mistakes, computer malfunctions, or other
reasons. Additionally, survey researchers interested in studying constructs over
Eric R. Buhi, Assistant Professor, Department
of Community and Family Health, College of Public Health, University of South Florida, Tampa,
FL. Patricia Goodson, Associate Professor, Department of Health and Kinesiology, Texas A&M
University, College Station, TX. Torsten B.
Neilands, Adjunct Assistant Professor, Center
for AIDS Prevention Studies, University of California San Francisco, San Francisco, CA.
Address correspondence to Dr Buhi, Department of Community and Family Health, College of
Public Health, University of South Florida, 13201
Bruce B. Downs Blvd, MDC 56, Tampa, FL 33612.
E-mail: [email protected]
Am J Health Behav.™
™ 2008;32(1):83-92
rameter estimates across all MDTs,
but loss of power with deletion
methods. NMAR results were biased towards larger values and
greater significance. Under MAR
the sophisticated MDTs returned
estimates closer to their original
values. Conclusion: State-of-theart, readily available MDTs outperform ad hoc techniques.
Key words: missing data, MDTs,
FIML, imputation, Bayesian analysis, health behavior research
Am J Health Behav. 2008;32(1):83-92
time via longitudinal methods are forced
to deal with missing data resulting from
attrition.1 For instance, participants in a
youth survey study may withdraw from
school and lose contact with researchers
or intentionally drop out of the study due
to issues directly related to their participation or to the survey’s questions. When
eligible participants do not take part in
the study, the missing data represent
survey nonresponse. Missingness can
also occur, however, within returned surveys, in continuous variables such as
age, income, and attitudes, or in categorical variables such as level of education,
ethnicity, and behavior (eg, ever had
sexual intercourse = yes/no). Survey researchers call missing values on items
such as these item nonresponse.2 In this
instance, partial data are still available
from participants.
Switzer and Roth3 noted the “range of
potential approaches to the problem of
missing data is very broad, from ignoring
the problem altogether to sophisticated
mathematical techniques for predicting
what data would have appeared in the
missing cells” (p. 310). Several books,2,4-7
83
Handling Missing Data
as well as methods and statistics book
chapters3,8-13 discuss the complex issues
regarding managing missing data. General purpose statistical software, including SPSS (www.spss.com) and SAS
(www.sas.com), and most structural equation modeling (SEM) packages including
Amos (www.spss.com/amos), EQS
(www.mvsoft.com),
LISREL
(www.ssicentral.com),
and
Mplus
(www.statmodel.com), aid the analyst in
addressing missing data. However, in order to maximize the utility of these statistical tools, the thoughtful health behavior
researcher should become familiar both
with potential missing data mechanisms
in a data set (or types of missing data) and
missing data techniques (MDTs—or the
strategies for handling missingness) that
provide the most accurate estimates
across different scenarios. The purpose of
this paper is, therefore, to (1) review 3
types of missingness (or missing data
mechanisms) possible in data sets, (2)
review techniques for managing missing
data (MDTs), and (3) compare these MDTs,
using simulated health behavior data,
across the various mechanisms of
missingness. This comparison will illustrate the application of the different MDTs
to specific missing data mechanisms.
Missing Data Mechanisms
More important than mere familiarity
with the various techniques for handling
missing data is understanding the manner in which the needed information
became “lost” or missing. This understanding is vital “because the different
techniques for dealing with the problem
make various assumptions about why
they are missing”14 (p. 70). In other words,
identification (or an informed guess) of
the mechanisms by which incomplete
data come to be missing can help data
analysts select the optimal methods to
address data missingness.
When data sets lack data points, 3
causes are usually at play: conditional
randomness; complete randomness; and
bias, or systematic reasons. These 3
causes lead to the classification of missing data into 3 types: data that are missing at random (MAR), missing completely
at random (MCAR), and not missing at
random (NMAR).2 An analyst may truly
never know whether the data in hand are
MAR, MCAR, or NMAR. In fact, all 3 types
of missing data may be present in a single
84
data set. However, analysts should carefully consider possible reasons that data
are missing, prior to conducting any statistical procedures.
Missing at random (MAR). Switzer and
Roth3 noted, “Most studies [involving] missing data assume that the data are [missing at random, or MAR],” which is “probably the least troublesome pattern” (p.
318). When data are MAR, incomplete
data arise not from the missing values
themselves; rather, the missingness is a
function of some other observed variables
in the data set (ie, those variables for
which the study has data).15 For example,
suppose members of one sex (V1) are less
likely to disclose their weight (V2) on a
questionnaire. The probability, then, that
V2 is missing depends on the value of V1,
for which data are available. In this case,
the missing data for V2 are MAR because
it is being a man or woman (and not how
much they weigh) that explains the missing data. Similarly, in cohort studies,
baseline covariates and outcomes can
often be used to meet the MAR assumption for participants who are lost to followup. For instance, gender and participant
weight measured at time 1 may be correlated with subsequent attrition or failure
to report weight at post-baseline measurements in a longitudinal study of gender and weight change. Because the factors leading to attrition are complex, a full
treatment of the topic of missing data
methods for longitudinal studies is beyond the scope of this paper. Interested
readers are referred to Little,16 Wothke,13
Verbeke and Molenberghs, 17 and
Molenberghs and Verbeke. 18 Data that
are MAR are termed ignorable, because
when this pattern occurs, the analyst can
ignore the reason(s) data are missing and
employ a missing data technique to manage the problem.4
Missing completely at random (MCAR).
A special case of MAR, when the data are
missing completely at random (MCAR),
occurs when the probability of
missingness is unrelated to both the observed variables (ie, those for which the
study has data) and the variables with
missing values (ie, those for which the
study has no or incomplete data). An example of MCAR data occurs when a study
participant fails to return for follow-up
questioning for reasons unrelated to the
study, such as bad weather, an illness, or
a death in the family. Another example
Buhi et al
occurs when a computer malfunction precludes retrieval of some participants’ data
values but not other participants’ values.
These circumstances occur completely
at random—there is no relationship between them and the study’s topic, design,
procedures, or measures. Similar to data
that are MAR, MCAR data are ignorable;
thus, the analyst can ignore the reason(s)
the data are missing and employ a MDT.
The interested reader can refer to Heitjan
and Basu19 for more in-depth treatment of
MAR and MCAR mechanisms.
Not missing at random (NMAR). Data
that are not missing at random (NMAR)—
in other words, data made missing by
systematic influences—may present complex issues for analysts who decide to use
certain missing data techniques, as
NMAR is the most problematic pattern of
missingness. NMAR as a missing data
mechanism means that “the probability
of missingness is related to values that
are themselves missing”14 (p. 70). For
instance, people who weigh more (V2)
may be less likely to disclose their weight
on a survey. Thus, if nondisclosure of
weight is related to the participant’s
weight and unrelated to all other observed
variables in the analysis, the probability
that information on a person’s weight is
missing depends solely on the value of V2,
which is unavailable, because it is missing. In another example, a high school
student may not answer a series of survey questions due to the sensitive nature
of the items (eg, sexual or delinquent
behavior questions). The probability of a
data set containing missing data on sexual
behavior items may be directly related to
the degree of sexual behavior in the survey sample or the perceived consequences
that may result if a person were to answer
these items, or to both factors simultaneously.
Data that are NMAR are nonignorable
because standard data analysis models
cannot properly take into account incomplete data that arise from an NMAR missing data mechanism. In situations where
analysts suspect missing data are due to
one or more NMAR mechanisms, analysts would need to directly model the
missing data mechanism as part of data
analyses of interest. Such modeling relies upon strong, untestable assumptions
about how the incomplete data came to be
missing, however.4 Rather than attempting to model the missing data process, a
Am J Health Behav.™
™ 2008;32(1):83-92
more fruitful approach health behavior
researchers can take to deal with NMAR
issues is to include additional measures
during the study design phase to enhance
the likelihood of incomplete data meeting
the MAR assumption (see Schafer & Graham15).
Missing Data Techniques
The methods for handling missing data
generally fall under 3 categories: deletion, direct estimation, and imputation
techniques. Listwise and pairwise deletion techniques are encompassed in the
first category. These techniques discard
cases during an analysis if they contain
missing data. As the default MDT in many
general purpose statistical software packages, deletion methods are easy to employ and do not require a wealth of statistical expertise. Thus, they are frequently
used. By contrast, direct estimation approaches such as full information maximum likelihood (FIML) and fully Bayesian
analysis use all available information in
the data, including the observed values
from cases with data on some, but not all,
variables, to construct parameter estimates and standard errors. In the case of
Bayesian analysis, the researcher also
has the option to introduce prior information into the analysis. Several methods
for managing missing data fall under the
category of imputation: mean substitution (or single imputation, which includes
hot deck imputation) and multiple imputation (MI). According to Allison,4 “the
basic idea is to substitute some reasonable guess (imputation) for each missing
value and then proceed to do the analysis
as if there were no missing data” (p. 11).
For instance, mean substitution replaces
missing values for a variable with the
variable mean score. Hot deck imputation replaces the missing value with values from similar respondents. MI, a more
sophisticated technique, replaces missing values with a complex combination of
single and multiple imputation results.
Below, we describe these techniques in
further detail.
Listwise deletion. Listwise deletion,
frequently referred to as complete case
analysis, involves excluding from the
analysis entire cases with missing values for any variable. For example, if each
respondent fails to answer just one questionnaire item, the software package,
using listwise deletion, would discard all
85
Handling Missing Data
the data and alert the analyst there are
no valid cases.
There are 2 potential problems with
this method. First, listwise deletion results in a loss of statistical power, or an
inflation of type II error. Because this
method includes only complete cases in
the analysis, it can reduce sample size
substantially and diminish analysts’ ability to find statistically significant effects.
Second, because survey takers may
choose not to answer certain questions,
completely eliminating these people from
the analysis may introduce bias.20 As Graham et al10 noted, “The people who provide
complete data in a study are very likely
going to be different from those who do not
provide complete data” (p. 89). Thus, analysts should be extremely cautious about
making inferences because sample data
analyzed using listwise deletion may not
be generalizable to the larger population.
Pairwise deletion. Pairwise deletion,
also referred to as available case analysis, uses all available data for each variable to compute means and variances.
When conducting correlational analyses
(ie, bivariate correlations, multiple regression), however, all available pairs of
values are used. Thus, due to varying
response rates for different survey items,
the resulting values of analyses are products of different subsets of the same
sample. 3
There are 2 major weaknesses associated with pairwise deletion. First, due to
“the differing numbers of observations
used to estimate components of the covariance matrix…”21 (p. 365), pairwise deletion can—according to Pigott21—“produce
estimated covariance matrices that are
implausible, such as estimating correlations outside the range of -1.0 to 1.0” (p.
365). Even if the correlation values appear plausible and fall within the range of
-1.0 to 1.0, pairwise deletion may yield
nonpositive definite correlation or covariance matrices, which can be problematic when using LISREL or similar SEM
software. The second weakness with this
technique is similar to that of listwise
deletion: pairwise deletion diminishes
sample size and, thus, may inflate type II
error. Schafer and Graham15 noted, “If a
missing-data problem can be resolved by
discarding only a small part of the sample,
then [pairwise deletion] can be quite effective” (p. 156). However, if many of the
data are discarded, analysts will face a
86
substantially reduced sample size resulting in wasted data and a loss of statistical
power.
Mean substitution. Mean substitution—perhaps “the most widely used estimation technique”22 (p. 403)—involves replacing missing values for cases with the
variable mean score. A strength of this
method (sometimes referred to as single
imputation) is that, by replacing the missing value with an actual value, the analyst is able to increase the sample to its
original size, solving the “wasted data”
issue created by using deletion methods.
A second strength is its ease of use; no
specialized computing software or statistical expertise is required because basic
functions in general purpose statistical
packages can easily execute the mean
substitution procedure. However, for
Pigott,21 “while this strategy allows the
inclusion of all cases in a standard analysis procedure, replacing the missing values with a single value changes the distribution of that variable by decreasing
the variance that is likely present” (p.
365). Tanguma23 observed that the variances and covariances resulting from
imputing a single value will be downwardly biased, yielding attenuated correlations among variables.
Hot deck imputation. Hot deck imputation is a variant of the mean substitution technique described above. This technique involves replacing a respondent’s
missing value with a value from a similar
respondent. A similar respondent is someone who shares the same patterns of
response for a group of matching variables (determined by the analyst). The
hot deck method is a common practice in
survey research, and is used by the US
Census Bureau in its Current Population
Survey and the Substance Abuse and
Mental Health Services Administration’s
National Survey on Drug Use and Health
(formerly called the National Household
Survey on Drug Abuse).2 A weakness of
hot deck imputation is its assumption of
no differences between nonrespondents
and respondents.24 Another weakness is
that specialized software or extensive programming in standard statistical software is needed to conduct hot deck imputation, and thus, this technique is not
widely available. Interested readers can
refer to Westsat25 for an example of such
specialized software.
Multiple imputation. Single imputa-
Buhi et al
tion techniques such as mean substitution and hot deck imputation cannot account for the uncertainty introduced by
imputing data for values that are missing. Representing a more sophisticated
MDT, multiple imputation (MI) involves
replacing the missing value with 2 or
more imputed values.2 MI generates m
multiple data sets where the observed
values are identical across the data sets,
but imputed values vary in value. It is this
variability from one imputed data set to
another that enables an MI-based analysis to properly factor in the uncertainty
involved in imputing missing values. MI
is used by the Centers for Disease Control
and Prevention in its AIDS surveillance
systems and in the National Health and
Nutrition Examination Survey conducted
by the US National Center for Health
Statistics.26
MI is characterized by 2 major
strengths: First, it corrects the lowered
variance problem created by the mean
substitution technique. Second, it allows
analysts to examine the variance due to
the imputation process and even compute statistical significance correctly. 3
Graham et al10 offered a fairly nontechnical explanation of the 3-step MI process:
First, one creates m imputed data sets,
such that each data set contains a
different imputed value for every missing value. The value of m can be anything greater than 1, but typically ranges
from 5 to 20. Second, one analyzes the
m data sets, saving the parameter estimates and standard errors from each.
Third, one combines the parameter
estimates and standard errors to arrive
at a single set of parameter estimates
and corresponding standard errors.
(p. 95)
A key advantage of MI is that once the
imputed data sets are created, the researcher can use them in almost any
type of analysis ranging from simple descriptive statistics through regression
methods and complex multivariate statistical analyses.
Although Little and Rubin2 point out
that “the only disadvantage of MI over
single imputation is that it takes more
work to create the imputations and analyze the results” (p. 86), MI is readily
available to SAS 9 users through PROC
MIANALYZE and to SPSS users through
Am J Health Behav.™
™ 2008;32(1):83-92
the optional Amos structural equation
modeling module. Researchers using
these packages can thereby capitalize on
using MI as a more state-of-the-art MDT.
Some stand-alone MI packages, such as
the NORM program, are available free of
charge on the World Wide Web.27 A documentation of the NORM program and others is provided by Schafer7 in his text on
incomplete multivariate data methods. A
readable how-to guide may be found in
Schafer and Olsen.28
It is important to note that multiple
imputation—conducted through SAS
PROC MI and PROC MIANALYZE—is based
on the assumption that the data arise
from a multivariate normal distribution.
Chen et al29 and Canchola et al30 warned
that ordinal variables with highly skewed
distributions and categorical variables,
such as sexual orientation, do not adhere
to this distributional assumption. Findings arising from analyses of multiply
imputed data sets when the joint multivariate normality assumption is violated
are robust to this assumption violation,
however, as long the statistical models
used to subsequently analyze the imputed data properly account for the data’s
nonnormality. Schafer7 cites simulation
studies and provides his own simulation
evidence illustrating the robustness to
nonnormality of imputation generating
models that assume joint multivariate
normality when the number of missing
data is moderate (eg, < 50%) and the
amount of nonnormality in variables not
severe. Moreover, recent versions of
NORM and PROC MI include variable transformation and categorical variable rounding utilities that may further improve the
performance of multiple imputation conducted under the assumption of joint
multivariate normality. Nonetheless, researchers must take caution and examine distributional assumptions before
choosing MI methods that assume joint
multivariate normality of the data. Other
MDTs, which we discuss below, including
direct maximum likelihood estimation of
categorical variable analysis models and
special MI methods for categorical and
mixed categorical and continuous variable data sets, may prove to be more
fruitful when variables exhibit extreme
departures from normality. For a comprehensive discussion of MI, see Rubin6 and
Schafer. 7
Maximum likelihood methods. Maxi-
87
Handling Missing Data
mum likelihood (ML) is a general statistical estimation approach that is used often in logistic and linear regression models,4 as well as in factor analysis and
latent class analysis.15 ML parameter estimates for cases with complete and incomplete data can be obtained by using
direct ML estimation (ie, full information
ML [FIML]; see Arbuckle9 and Wothke13).
FIML is a direct method in the sense that
model parameters and standard errors
are estimated directly from available data.
The FIML algorithm directly estimates
model parameters; it does not impute
missing values.31 Cases with incomplete
data are included in computations, however, and all available data are employed
by the ML algorithm to obtain optimal
parameter estimates. The major SEM
packages including Amos 7, LISREL 8,
and Mplus 4 have the capability to run
FIML for linear models with continuous
outcomes;32 Mplus 4 features a similar
suite of estimators for models involving
ordered categorical, nominal, and Poisson distributed outcomes.33 Although it is
quite complex in terms of computation,
FIML is extremely easy to use and transparent to the researcher. For instance, in
Amos, researchers can invoke FIML estimation with a single mouse click.
Bayesian methods. Over the past 10
years, researchers in the social sciences
and medicine have been increasingly
using Bayesian statistical methods.34,35
Classical statistical methods rely on comparisons of statistics generated from
sample data to hypothesized population
parameters that are each assumed to be
an unknown but fixed value. In contrast
to classical statistical theory, the Bayesian framework relies on the notion of
subjective probability.36 In short, Bayesian analysts incorporate preexisting evidence and beliefs about a parameter into
what is called a prior distribution of probabilities. In other words, in the Bayesian
analysis framework, a population’s parameters will vary depending upon its
subjective probability. Once data are collected, a mathematical routine derived
from Bayes’ theorem is used to combine
the prior distribution with the new data to
generate an updated posterior distribution, on which conclusions are then
drawn.37 Conveniently, if the analyst assumes a uniform (also referred to as a
diffuse or noninformative) prior distribution in which each value of the distribu-
88
tion is postulated to have an equally likely
chance of occurring, the posterior distribution is asymptotically equivalent to the
likelihood of the observed data. In Bayesian analysis, incomplete data values are
viewed as additional parameters to be
estimated, subject to the constraints of
the analyst’s model and the selected prior
distributions of model parameters. According to Burton et al,37 “Bayesian-based
inferences are more intuitive and can
lead to more appropriate decisions” (p.
320) than traditional null hypothesis significance testing. One reason for the increased use of Bayesian methods has
been major improvements in computing
power. Bayesian estimation for continuous, normally distributed and ordered categorical outcome variables is available in
the Amos structural equation modeling
program.
Which Missing Data Technique is
Optimal? An Illustration
No one missing data technique is optimal for every missing data situation. As
stated earlier, understanding the conditions in which data are MAR, MCAR, or
NMAR is pivotal “because the different
techniques for dealing with the problem
make various assumptions about why
they are missing”14 (p. 70). For instance,
listwise deletion, pairwise deletion, and
mean substitution techniques assume
that all incomplete data arise from an
MCAR process. If the missing data do, in
fact, fall under the MCAR category, then
using these ad hoc methods may result in
nonbiased parameter estimates.9,13,38 If any
data are missing due to a MAR or NMAR
mechanism, however, results from subsequent analyses employing these ad hoc
methods will be biased.
FIML, Bayesian estimation, and MI
assume MCAR or MAR data. If data
missingness is due to one or both of these
mechanisms, results from analyses employing FIML, Bayesian methods, or MI
will generally be unbiased. If NMAR
missingness is present, no standard
analysis method—which we described
previously—can fully remove bias.
Muthén et al 38 argued that FIML will
result in less bias under NMAR than will
listwise deletion, pairwise deletion, and
related ad hoc methods, however. For
most conditions, mean substitution is
not recommended as an accurate missing data technique.3,10
Buhi et al
A very strict assumption that typically
does not hold fully in most studies occurs
when researchers assume missing data
arise from a MCAR situation. MAR scenarios are much more likely in social and
behavioral science research, especially
if researchers measure constructs such
as social desirability and need for cognition or reading comprehension in their
surveys. These types of constructs, for
which it is often possible to obtain complete data, can be useful in predicting who
does or does not complete survey instruments. According to Schafer and Olsen,28
“in the vast majority of studies, principled
methods [such as MI and FIML] that assume MAR will tend to perform better
than ad hoc procedures such as listwise
deletion or imputation of means” (p. 553).
To illustrate these missing data
mechanisms and how they may impact
results when employing various MDTs,
we generated simulated data from a population with known parameter values.
From the SAS Technical Support website,
we first downloaded mvn.sas, a program
that simulates bivariate or multivariate
normal data. We supplied the program
with a sample correlation matrix and
mean vector consisting of zeros with a
sample size of n = 300 (a typical sample
size found in health behavior survey research), in order to create 3 continuous
variables with a joint multivariate normal distribution. The generated complete
data set consisted of 3 columns (variables): COL1, COL2, and COL3. In this
illustration, COL1 represents alcohol use
behavior, COL2 represents depression,
and COL3 represents the perceived norm
toward alcohol use. We next created 3
different data sets containing missing
values for COL1. To illustrate a MCAR
scenario, we simply deleted the last 25%
of COL1 values—a completely random process because the computer-generated
data were themselves generated via a
completely random process. To represent
a MAR scenario, we dropped 25% of COL1
values, defined by each case’s COL2 and
COL3 scores, with higher values of COL2
and lower values of COL3 being more
likely to have missing values. Finally, to
illustrate a NMAR scenario, we dropped
25% of COL1 values, defined only by their
original COL1 values, with lower values of
COL1 being more likely to be missing.
The first analysis involved the complete data set, where we regressed COL1
Am J Health Behav.™
™ 2008;32(1):83-92
(alcohol use behavior) onto COL2 (depression) and COL3 (perceived norm toward
alcohol use). This initial investigation
served as the benchmark against which
we could compare our analyses of data
sets containing missing values. The first
row in Table 1 (labeled “Regression”) contains these complete data analysis results. Next, we subjected the MAR, MCAR,
and NMAR data sets to multiple imputation with 10 imputations created, followed by SAS PROC REG analyses of each
of the imputed data sets and the use of
PROC MIANALYZE to combine the results
from the 10 PROC REG analyses. We then
exported the MAR, MCAR, and NMAR data
sets into SPSS 14 and employed listwise
deletion, pairwise deletion, and mean
substitution techniques. Finally, we exported the 3 data sets into Amos 7, which
offers FIML estimation of incomplete
multivariate normal data as well as Bayesian estimation. Here, we set up the
regression analysis using Amos Graphics and fit the regression model to the
MAR, MCAR, and NMAR data. For each
analysis we report the regression estimates and respective 95% confidence
intervals for the COL2 and COL3 regression coefficients; where available, we also
report traditional p-values associated with
the null hypothesis that the relevant parameter is zero in the population from
which we drew our sample.
Results from the analyses described
above can be found in Table 1. Moving
from left to right in the table, readers can
see that the MCAR situation yielded unbiased estimates across the 6 MDTs, but
somewhat of a loss of statistical power for
the COL3 predictor with the deletion and
mean substitution techniques. The MAR
situation resulted in slightly downwardly
biased parameter estimates using MI,
FIML, and Bayesian estimation. Interestingly, when we employed listwise deletion (in the MAR situation), the COL3
predictor (perceived norm toward alcohol
use) actually resulted in a reversed sign,
from negative to positive, losing significant explanatory power in the theorized
direction. Pairwise deletion, under the
MAR situation, resulted in slightly biased
(larger) values and inflated standard errors for the COL3 predictor. In the NMAR
situation, the results were substantially
biased towards bigger values, more significance, and inflated standard errors for
COL2, and a general loss of statistical
89
Handling Missing Data
Table 1
Illustration of Multiple Linear Regression Analysis
Using Missing Data Techniques
Unstandardized parameter estimates (Standard errors)
[95% confidence interval (CI) estimates]a
Complete Data
(n = 300)
MCAR
(25% missingness
on COL1,d
arising from a
random process)
MAR
(25% missingness
on COL1, defined
by COL2 &
COL3 scores)
NMAR
(25% missingness
on COL1, defined
only their original
COL1 values)
COL2d
COL3d
COL2
COL3
COL2
COL3
COL2
COL3
.233 (.059)
[95% CI =
.016, .349]***
-.132 (.052)
[95% CI =
-.234, -.030]**
-
-
-
-
-
-
Multiple
imputation
-
-
FIML
-
-
Bayesian estimationb -
-
Listwise deletionc
-
-
c
Pairwise deletion
-
-
Mean substitution
-
-
.214 (.061)
[.095, .333]***
.215 (.064)
[.090, .340]***
.213 (.064)
[.086, .336]
.213 (.065)
[.086, .340]***
.219 (.065)
[.091, .346]***
.216 (.065)
[.088, .343]***
-.130 (.053)
[-.233, -.026]**
-.127 (.053)
[-.231, -.023]**
-.127 (.053)
[-.228, -.023]
-.108 (.058)
[-.222, .006]
-.129 (.059)
[-.245, -.013]*
-.121 (.053)
[-.224, -.017]*
.193 (.069)
[.057, .329]**
.202 (.066)
[.073, .331]**
.200 (.066)
[.069, .327]
.195 (.065)
[.067, .322]**
.255 (.065)
[.127, .384]***
.246 (.065)
[.117, .374]***
-.127 (.053)
[-.231, -.022]**
-.126 (.053)
[-.230, -.022]**
-.125 (.052)
[-.228, -.021]
.077 (.062)
[-.045, .200]
-.143 (.059)
[-.259, -.027]**
-.128 (.052)
[-.231, -.025]**
.292 (.084)
[.127, .457]***
.286 (.088)
[.114, .459]***
.284 (.088)
[.106, .453]
.268 (.083)
[.103, .432]**
.268 (.086)
[.098, .438]**
.258 (.086)
[.088, .428]**
-.136 (.053)
[-.241, -.032]**
-.136 (.054)
[-.242, -.029]**
-.135 (.054)
[-.241, -.029]
-.150 (.057)
[-.261, -.038]**
-.133 (.060)
[-.250, -.016]*
-.123 (.053)
[-.228, -.019]*
Regression
Note.
a
CIs provide information for point null hypothesis testing—if the interval includes zero, the
coefficient enclosed by the interval is not significant at P < .05)—but they also document the
precision of the parameter estimate—narrow intervals signify greater precision, whereas wider
intervals signify less precision.
b
Bayesian estimation in Amos does not produce p-values, but does produce 95% CIs that may be
interpreted in the same way as FIML-based 95% CIs when the data analyst uses a uniform or
non-informative prior distribution for model parameters. For instance, in our example the
Bayesian 95% CIs are virtually identical to those produced by the Amos FIML algorithm, so
essentially the same substantive conclusions may be drawn using the Bayesian and FIML
methods.
c
When employing listwise and pairwise deletion methods, the sample size was reduced to n = 240.
d
COL1 = alcohol use behavior, COL2 = depression, COL3 = perceived norm toward alcohol use
*** P < .001
* * P < .01
*
P < .05
power for the COL3 predictor with pairwise
deletion and mean substitution techniques.
Consistent with the statistical simulation literature,39 our analysis illustrated
an instance in which the listwise deletion technique did not work very well
under the MAR situation, yet FIML, MI,
and Bayesian estimation worked quite
well. Given the ready availability of the
state-of-the-art techniques in easy-touse software programs, there seems little
reason to continue using ad hoc methods
to handle missing data, unless the num-
90
ber of missing data is so small that the
resulting bias is inconsequential (eg,
5%; see Roth40). However, results from
this illustration also demonstrate the
risks of analyzing NMAR data using methods that assume incomplete data arise
from exclusively MAR or MCAR
missingness mechanisms. Generally,
across all 3 patterns of missingness, the
MI, FIML, and Bayesian estimation procedures restored parameter estimates,
confidence intervals, and statistical significance closer to the original (complete
data) values.
Buhi et al
CONCLUSION
As missing data pervade health behavior research, scholars are guaranteed to
face the problem at some time in their
data collection efforts. Unfortunately, the
predicament is one for which there are no
ideal solutions, and the researcher must
choose—among several complex techniques—the one that optimizes accuracy
of estimated parameters, while avoiding
error inflation. The purpose of this paper
was to make available to health behavior
researchers—especially to those using
survey methodologies—a primer on techniques for managing missingness. Thus,
this paper described 3 mechanisms of
missing data (MAR, MCAR, and NMAR),
and presented ad hoc methods—listwise
deletion, pairwise deletion, and single
substitution—as well as more state-ofthe-art techniques—MI, FIML, and Bayesian estimation—for managing missing
data under MCAR and MAR conditions.
Further, we illustrated, using simulated
health behavior data sets, how these missing data techniques compare across the
various mechanisms of missingness in a
simplified, but prototypical health behavior research data analysis. Our illustration reflected the findings from many
data simulation studies published in the
statistical literature: sophisticated MDTs
(FIML, Bayesian estimation, and MI) generally outperform the older ad hoc approaches (listwise deletion, pairwise deletion, and mean substitution). Based on
the results from those simulation studies, we suggest that health behavior researchers employ sophisticated MDTs
whenever possible in situations where
the amount of missing data is not trivially
small (ie, < 5% of the sample). FIML and
Bayesian methods are the easiest to use
for model types and outcomes supported
by software programs that feature these
estimation methods; for all other analysis problems, MI is a useful approach.
Finally, at the risk of stating the obvious, it is important to remember that the
best way of handling missing data in
health behavior research is by means of
prevention. Taking necessary precautions
to ensure optimal response rates and
item-response with little or no missing
information is not merely ideal, but a goal
for which every researcher should strive
with his or her research design and data
collection methods. An important component of this prevention effort may also
Am J Health Behav.™
™ 2008;32(1):83-92
include adequate preparation for data that,
despite researchers’ best preventive efforts, turn out to be incomplete: measuring variables that, when included in the
data help them meet the MAR assumption in the analysis, can be a helpful
strategy. For instance, we suggest researchers collect as much data as possible and collect auxiliary information.41
In the example of measuring weight data,
researchers should collect information
related to correlates of weight and use
proxy measures of weight in order to minimize MNAR scenarios. Researchers may
also consider social desirability issues
regarding variables of interest and
whether respondents can adequately understand questions being posed.
Regardless of the efforts employed to
prevent or prepare for missingness, as
stated previously, health behavior researchers will likely face the problem and
the need to choose among alternative
techniques. Becoming familiar with available options and understanding their applicability are important first steps. Our
wish is to facilitate accomplishing these
goals.
„
REFERENCES
1.Barry AE. How attrition impacts the internal
and external validity of longitudinal research.
J Sch Health. 2005;75(7):267-270.
2.Little RJA, Rubin DB. Statistical Analysis
with Missing Data. Hoboken, NJ: John Wiley
& Sons, Inc; 2002.
3.Switzer FS, Roth PL. Coping with missing
data. In: Rogelberg SG, ed. Handbook of Research Methods in Industrial and Organizational Psychology. Malden, MA: Blackwell;
2002:310-323.
4.Allison PD. Missing Data. Thousand Oaks,
CA: Sage; 2002.
5.Groves R, Dillman D, Eltinge J, Little RJA.
Survey Nonresponse. New York: Wiley; 2002.
6.Rubin DB. Multiple Imputation for
Nonresponse in Surveys. New York: Wiley;
1987.
7.Schafer JL. Analysis of Incomplete Multivariate Data. New York: Chapman and Hall; 1997.
8.Anderson AB, Basilevsky A, Hum DPJ. Missing data: A review of the literature. In: Rossi
PH, Wright JD, Anderson AB, eds. Handbook
of Survey Research. San Diego, CA: Academic
Press; 1983:415-494.
9.Arbuckle JL. Full information estimation in
the presence of incomplete data. In:
Marcoulides GA, Schumacker RE, eds. Advanced Structural Equation Modeling: Issues
and Techniques. Mahwah, NJ: Lawrence
Erlbaum Associates; 1996:243-277.
10.Graham JW, Cumsille PE, Elek-Fisk E. Meth-
91
Handling Missing Data
ods for handling missing data. In: Schinka
JA, Velicer WF, eds. Handbook of Psychology.
Vol 2: Research Methods in Psychology.
Hoboken, NJ: Wiley & Sons, Inc.; 2003:87114.
11.Little RJA, Rubin DB. The analysis of social
science data with missing values. In: Fox J,
Long JS, eds. Modern Methods of Data Analysis. Newbury Park, CA: Sage; 1990.
12.Tabachnick BG, Fidell LS. Using Multivariate
Statistics. 5th ed. Boston: Allyn & Bacon;
2007.
13.Wothke W. Longitudinal and multigroup modeling with missing data. In: Little TD, Schnabel
KU, eds. Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches, and Specific Examples. Mahwah,
NJ: Lawrence Erlbaum Associates; 2000:219240,269-281.
14.Streiner DL. The case of the missing data:
Methods of dealing with dropouts and other
research vagaries. Can J Psychiatry.
2002;47(1):68-75.
15.Schafer JL, Graham JW. Missing data: Our
view of the state of the art. Psychol Methods.
2002;7(2):147-177.
16.Little RJA. Modeling the drop-out mechanism
in repeated-measures studies. J Am Stat Assoc.
1995;90(431):1112-1121.
17.Verbeke G, Molenberghs G. Linear Mixed
Models for Longitudinal Data. New York:
Springer-Verlag; 2000.
18.Molenberghs G, Verbeke G. Models for Discrete Longitudinal Data. New York: SpringerVerlag; 2005.
19.Heitjan DF, Basu S. Distinguishing ‘missing
at random’ and ‘missing completely at random’. Am Stat. 1996;50:207-213.
20.O’Rourke TW. Methodological techniques for
dealing with missing data. American Journal of
Health Studies. 2003;18(2/3):165-168.
21.Pigott TD. A review of the methods for missing data. Educational Research and Evaluation.
2001;7(4):353-383.
22.Raymond M. Missing data in evaluation research. Evaluation and the Health Professions.
1986;9(4):395-420.
23.Tanguma J. A review of the literature on
missing data. Paper presented at the annual
meeting of the Mid-South Educational Research Association. Bowling Green, KY; 2000,
November. (ERIC Document Reproduction Service No. ED 448 174).
24.Patrician PA. Multiple imputation for missing
data. Res Nurs Health. 2002;25:76-84.
25.Westsat. WesVar: Software for analysis of
data from complex samples (On-line). Available at: http://www.westat.com/wesvar/. Accessed September 21, 2006.
26.Barnard J, Meng X. Applications of multiple
92
imputation in medical studies: From AIDS to
NHANES. Stat Methods Med Res. 1999;8:1736.
27.Schafer JL. Software for multiple imputation
(On-line).
Available
at:
http://
www.stat.psu.edu/~jls/misoftwa.html#top.
Accessed September 21, 2006.
28.Schafer JL, Olsen MK. Multiple imputation
for multivariate missing-data problems: a data
analyst’s perspective. Multivariate Behav Res.
1998;33:545-571.
29.Chen L, Toma-Drane M, Valois RF, Drane JW.
Multiple imputation for missing ordinal data.
J Mod Appl Stat Methods. 2005;4(1):288-299.
30.Canchola JA, Neilands TB, Catania JA. Imputation strategies for sexual orientation using
SAS PROC MI. Paper presented at: The Tenth
Annual Western Users of SAS Software Conference, 2002; San Diego, CA (On-line). http:/
/www.hsrg.net/download.htm. Accessed
January 17, 2007.
31.Enders CK. A primer on maximum likelihood
algorithms available for use with missing
data.
Structural
Equation
Modeling.
2001;8(1):128-141.
32.Buhi ER, Goodson P, Neilands TB. Structural
equation modeling: a primer for health behavior researchers. Am J Health Behav.
2007;31(1):74-85.
33.Muthén LK, Muthén BO. Mplus User’s Guide.
4th ed. Los Angeles, CA: Muthén & Muthén;
2006.
34.Gill J. Introduction to the special issue.
Political Analysis. 2004;12:323-337.
35.Jackman S. Estimation and inference are
missing data problems: Unifying social science statistics via Bayesian simulation. Political Analysis. 2004;12:307-332.
36.Bolstad WM. Introduction to Bayesian Statistics. Hoboken, NJ: John Wiley and Sons;
2004.
37.Burton PR, Gurrin LC, Campbell MJ. Clinical significance not statistical significance:
a simple Bayesian alternative to p values. J
Epidemiol Community Health. 1998;52:318323.
38.Muthén B, Kaplan D, Hollis M. On structural
equation modeling with data that are not
missing completely at random. Psychometrika.
1987;52:431-462.
39.Enders CK. The performance of the full
information maximum likelihood estimator in
multiple regression models with missing data.
Educ Psychol Meas. 2001;61(5):713-740.
40.Roth PL. Missing data: a conceptual review
for applied psychologists. Personnel Psychology. 1994;47:537-560.
41.De Leeuw ED. Reducing missing data in
surveys: an overview of methods. Qual Quant.
2001;35:147-160.