ARPA/Rome Proposal - Advances in Computer Science Research

Transcription

ARPA/Rome Proposal - Advances in Computer Science Research
An Experimental Comparison of Methods for Dealing with Missing Values in
Data Sets when Learning Bayesian Networks
Maciej Osakowicz1 & Marek J. Druzdzel1,2
1
Faculty of Computer Science, Bialystok University of Technology, Wiejska 45-A, 15-351 Bialystok, Poland, [email protected] (currently with X-code Sp. z o.o., Klaudyny 21/4,
01-684 Warszawa)
2
Decision Systems Laboratory, School of Information Sciences and Intelligent Systems Program,
University of Pittsburgh, Pittsburgh, PA 15260, USA, [email protected]
Extended Abstract
Decision-analytic modeling tools, such as Bayesian networks (Pearl, 1988) are increasingly finding
successful practical applications. Their major strengths are sound theoretical foundations, ability to
combine existing data with expert knowledge, and the intuitive framework of directed graphs.
Bayesian networks can be based on expert opinion but can be also learned from data. A major problem with the latter is that many practical data sets contain missing values that defy simple counting
when learning conditional probability distributions (Little & Rubin, 1987). And so, of the 189 real
data sets available through the University of California at Irvine’s Machine Learning Repository
(2015), 58 (around 31%) contain missing values.
Several methods have been proposed to deal with missing values. However, there is no comprehensive study that would compare them on a variety of practical data sets. There are several questions
that one might ask in this context, for example, which of the methods is the fastest. However, learning is part of a preparatory phase, which is typically performed once. Of more importance is the
quality of the resulting models, expressed by criteria such as accuracy, area under the ROC curve, or
model calibration. Of these, model accuracy, measured typically on classification tasks, seems to be
most popular.
In this paper, we report the results of a systematic comparison of eight methods for dealing with
missing values: (1) introducing for each variable with missing values an additional state that represents a missing value, (2) removing all records with missing values, (3) replacing each missing value
by a randomly chosen state (uniform distribution) from among the possible states that the variable
can take, (4) replacing each missing value by the most likely value, (5) Similar Response Pattern
Imputation (SRPI) (Joreskog & Sorbom, 1996), (6) replacing each missing value by the average value of the variable (Raymond, 1986), (7) replacing each missing value by the class average, and (8)
five variants of the k-nearest neighbors method (1-NN, 5-NN, 1%-NN, 10%-NN, and 25%-NN) (Fix
& Hodges,1951).
Our comparison is based on 11 data sets from the UCI Machine Learning Repository: Adult, Thyroid
Disease, Mammographic Mass, Cylinder Bands, Congressional Voting Records, Hepatitis, Echocardiogram, Soybean, Horse Colic, Heart Disease and Annealing. Each of these sets contains missing
values and each contains a class variable, which makes them suitable for a comparison of classification accuracy.
We conducted the experiments using two network structures: (1) unrestricted and general Bayesian
network, and (2) naïve Bayesian network, i.e., one in which all feature variables are independent
conditional on the class variable.
Our results support the following conclusions:
(1) In most cases, differences among the methods in terms of the classification accuracy of the
resulting Bayesian networks are minimal, i.e., none of the methods uniformly stands out as
consistently better than the others on all data sets.
(2) The differences between the different methods were smaller for naïve Bayes structures than
they were for unrestricted structures.
(3) Replacing missing values by additional state seems performs the worst among the tested
methods.
(4) Removal of records with missing data may work well when the number of records with missing values is small (e.g., fewer than 10%). In case of several of our sets, this method led to
removing too many records to allow for reliable learning of the parameters from the remaining records.
(5) Replacing the missing values by the average seems to be a simple, fast method that leads to
reasonable accuracy.
(6) The k-NN method leads consistently to good results, although only when we consider its various variants, i.e., different values of k. It is the most computationally intensive of the methods tested in our experiment, which may lead to problems in case of very large data sets.
Acknowledgments
Partial support for this work has been provided by the National Institute of Health under grant
number U01HL101066-01. Implementation of this work is based on SMILE, a Bayesian learning and inference engine developed at the Decision Systems Laboratory and available at
http://genie.sis.pitt.edu/.
References
E. Fix & J.L Hodges (1951). Discriminatory analysis, non-parametric discrimination: consistency
properties. Technical report, USAF School of aviation and medicine, Randolph Field.
University of California Irvine Machine Learning Repository (2015), https://archive.ics.uci.edu/ml/
Karl Joreskog & Dag Sorbom (1996). PRELIS 2: User’s Reference Guide. Scientific Software International, Inc, Lincolnwood, IL, USA, 3rd edition.
Roderick J A Little & Donald B Rubin (1987). Statistical analysis with missing data. John Wiley &
Sons, Inc., New York, NY, USA, 2nd edition.
Judea Pearl (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers: San Francisco, CA.
M. Raymond (1986). Missing data in evaluation research. Evaluation and the Health Professions,
9(4):395 – 420.