CRAN Task View: Missing Data
|Maintainer:||Julie Josse, Imke Mayer, Nicholas Tierney, and Nathalie Vialaneix (r-miss-tastic team)|
|Contact:||r-miss-tastic at clementine.wf|
|Contributions:||Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.|
|Citation:||Julie Josse, Imke Mayer, Nicholas Tierney, and Nathalie Vialaneix (r-miss-tastic team) (2022). CRAN Task View: Missing Data. Version 2022-08-24. URL https://CRAN.R-project.org/view=MissingData.|
|Installation:||The packages from this task view can be installed automatically using the ctv package. For example, |
ctv::install.views("MissingData", coreOnly = TRUE) installs all the core packages or
ctv::update.views("MissingData") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details.
Missing data are very frequently found in datasets. Base R provides a few options to handle them using computations that involve only observed data (
na.rm = TRUE in functions
var, … or
use = complete.obs|na.or.complete|pairwise.complete.obs in functions
cor, …). The base package
stats also contains the generic function
na.action that extracts information of the
NA action used to create an object. In addition, the package ie2misc contains a dyadic operator
+ that behaves differently than the original
+ operator regarding missing data.
These basic options are complemented by many packages on CRAN. In this task view, we focused on the most important ones, which have been published more than one year ago and are regularly updated. The task view is structured into main topics:
In addition to the present task view, this reference website on missing data might also be helpful. Complementary information might also be found in TimeSeries, SpatioTemporal, Survival, and OfficialStatistics. Note that most packages covering temporal, and spatio-temporal interpolation and censored data are not covered by the Missing Data task view.
If you think we have missed some important packages in this list, please e-mail the maintainers or submit an issue or pull request in the GitHub repository linked above.
Exploration of missing data
- Manipulation of missing data is implemented in the packages sjmisc, sjlabelled, retroharmonize, mde (also providing basic functions to explore missingness patterns), and tidyr (which abides by tidyverse principles). In addition, memisc provides definable missing values, along with infrastructure for the management of survey data and variable labels.
- Missing data patterns can be identified and explored using the packages mi, wrangle, DescTools, dlookr and naniar.
- Graphics that describe distributions and patterns of missing data are implemented in VIM (which has a Graphical User Interface, VIMGUI, currently archived on CRAN) and naniar (which abides by tidyverse principles).
- Tests of the MAR assumption (versus the MCAR assumption): Little’s test for the MCAR assumption is implemented in misty. Other approaches are also available elsewhere: RBtest proposes a regression based approach to test for missing data mechanisms and samon performs sensitivity analysis in clinical trials to check the relevance of the MAR assumption. In addition, isni tests sensitivity to the ignorability assumption by computing the index of local sensitivity to nonignorability.
- Evaluation: missCompare and missMethods offer an entire framework to compare different imputation strategies (with diagnostics and visualizations). The package Iscores can also be useful to evaluate imputation quality using a KL-based scoring rule.
Simulations to evaluate imputation qualities can be performed using the function
ampute of mice, the package simFrame, which proposes a very general framework for simulations, or the package simglm, which simulates data and missing values in simple and generalized linear regression models. Similarly, imputeTestbench provides a benchmark to evaluate univariate time series imputation.
In addition, mi and VIM also provide diagnostic plots that can help evaluate imputation quality.
Likelihood based approaches
- Methods based on the Expectation Maximization (EM) algorithm are implemented in norm (using the function
em.norm for multivariate Gaussian data), norm2 (using the function
emNorm), in cat (function
em.cat for multivariate categorical data), in mix (function
em.mix for multivariate mixed categorical and continuous data). These packages also implement Bayesian approaches (with Imputation and Posterior steps) for the same models (functions
mix) and can be used to obtain imputed complete datasets or multiple imputations (functions
mix), once the model parameters have been estimated. monomvn proposes similar methods for multivariate normal and Student distributions when the missingness pattern is monotonic.
imputeMulti, and MMDai extend these methods by using an EM approach to fit different mixtures of multivariate missing data for categorical data. RMixtCompIO is a complete library of mixture models that handles missing data and is based on the C++ library
MixtComp. It can be used in combination with RMixtCompUtilities, which provides various graphical, getter, and utility functions.
Hierarchical Gaussian and probit models with missing covariate values are implemented in ppmSuite. PReMiuM implements Dirichlet process mixture models (regression models linking the response to covariates through cluster membership) with missing covariate values.
imputeR is also using an EM based imputation framework that offers several different algorithms, including Lasso, tree-based models or PCA. In addition, TestDataImputation implements imputation based on EM estimation (and other simpler imputation methods) that are well suited for dichotomous and polytomous tests with item responses.
- Multiple imputation is performed using Maximum Likelihood Multiple Imputation in mlmi.
- Full Information Maximum Likelihood (also known as “direct maximum likelihood” or “raw maximum likelihood”) is available in lavaan (and in its extension semTools), OpenMx, rsem, and simsem for handling missing data in structural equation modeling.
- Bayesian approaches for handling missing values in model based clustering with variable selection is available in VarSelLCM. The package also provides imputation using the posterior mean.
- Missing values in generalized linear models can be handled with package mdmb for various families. JointAI implements Bayesian approaches for generalized linear mixed models.
- Missing data in item response models (including Rasch models and extensions) is implemented in TAM, mirt, eRm, and ltm for univariate or multivariate responses. LNIRT also addresses these models but allows missing values to be specified as “missing-by-design” and MLCIRTwithin includes latent-class models.
- The simplest method for missing data imputation is imputation by mean (or median, mode, ...). This approach is available in many packages among which Hmisc that contains various proposals for imputing with the same value all missing instances of a variable.
- Generic packages: The packages VIM and filling contain several popular methods for missing value imputation (including some listed in the sections dedicated to specific methods as listed below). In addition, simputation is a general package for imputation by any prediction method that can be combined with various regression methods, and works well with the tidyverse.
- k-nearest neighbors is a popular method for missing data imputation that is available in many packages including the main packages yaImpute (with many different methods for kNN imputation, including a CCA based imputation) and VIM. It is also available in impute (where it is oriented toward microarray imputation).
isotree uses a similar approach to impute missing values, which is based on similarities between samples and isolation forests.
- Hot-deck imputation is implemented in the package hot.deck, with various possible settings (including multiple imputation). It is also available in VIM (function
hotdeck) and a fractional version (using weights) is provided in FHDI. StatMatch also uses hot-deck imputation to impute surveys from an external dataset.
Similarly, impimp uses the notion of a “donor” to impute a set of possible values, termed “imprecise imputation”.
- Imputation based on random forest is implemented in missForest with a faster version in missRanger.
- Other regression based imputations are implemented in VIM (linear regression based imputation in the function
regressionImp). iai tunes optimal imputation based on knn, tree or SVM.
- Matrix completion is implemented with iterative PCA/SVD-decomposition in the package missMDA for numerical, categorical and mixed data (including imputation of groups). NIPALS (also based on SVD computation) is implemented in the packages mixOmics (for PCA and PLS), ade4, nipals and plsRglm (for generalized model PLS). cmfrec is also a large package dedicated to matrix factorization (for recommender systems), which includes imputation. Other PCA/factor based imputations are available in pcaMethods (with a Bayesian implementation of PCA), in primePCA (for heterogeneous missingness in high-dimensional PCA) and tensorBF (for 3-way tensor data). Low rank based imputation is provided in softImpute, which contains several methods for iterative matrix completion. This method is also available in the very general package rsparse, which contains various tools for sparse matrices. Variants based on low rank assumptions are available in denoiseR, in mimi, in ECLRMC and CMF (for ensemble matrix completion), and in ROptSpace (with a computationally efficient approach).
- Imputation based on copula is implemented in CoImp with a semi-parametric imputation procedure and in mdgc using Gaussian copula for mixed data types.
Some of the above mentioned packages can also handle multiple imputations.
- Amelia implements Bootstrap multiple imputation using EM to estimate the parameters, for quantitative data it imputes assuming a Multivariate Gaussian distribution. In addition, AmeliaView is a GUI for Amelia, available from the Amelia web page.
NPBayesImputeCat also implements multiple imputation by joint modeling for categorical variables but using a Bayesian approach.
- mi, mice, and smcfcs implement multiple imputation by Chained Equations. Other packages are based on or extend mice, like miceFast, which provides an alternative implementation of mice imputation methods using object oriented style programming and C++, bootImpute, which performs bootstrap based imputations and analyses of these imputations, and miceRanger and CALIBERrfimpute, which both perform multiple imputation by chained equations using random forests.
- missMDA implements multiple imputation based on SVD methods.
- hot.deck implements hot-deck-based multiple imputation.
- rMIDAS implements multiple imputation based on denoising auto-encoders.
- Multilevel imputation: Multilevel multiple imputation is implemented in jomo, mice, miceadds, micemd, mitml, and pan.
- Qtools and miWQS implement multiple imputation based on quantile regression.
- lodi implements the imputation of observed values below the limit of detection (LOD) via censored likelihood multiple imputation (CLMI).
In addition, mitools provides a generic approach to handle multiple imputation in combination with any imputation method, NADIA provides a uniform interface to compare the performance of several imputation algorithms, cobalt computes balance tables and plots for multiply imputed datasets, and SynthTools provides confidence intervals for multiply imputed datasets.
- Computation of weights for observed data to account for unobserved data by Inverse Probability Weighting (IPW) is implemented in ipw. IPW is also used for quantile estimations and boxplots in IPWboxplot.
- Doubly Robust Inverse Probability Weighted Augmented GEE Estimator with missing outcome is implemented in CRTgeeDR.
Specific types of data
- Longitudinal data / time series data: Imputation for time series is implemented in imputeTS. Other packages, such as forecast, spacetime, timeSeries, xts, prophet, stlplus, or zoo, are dedicated to time series but also contain some (often basic) methods to handle missing data (see also TimeSeries). Based on tidy principle, the padr and tsibble also provide methods for imputing missing values in time series. Similarly, DTSg offers basic functionality for missing value description and imputation in time series based on the fast
More specific methods are implemented in other packages: imputation of time series based on Dynamic Time Warping is implemented in the family of packages DTWBI, DTWUMI, and FSMUMI for univariate and multivariate time series. BMTAR provides an estimation of the autoregressive threshold models with Gaussian noise using a Bayesian approach in the presence of missing data in multivariate time series. swgee implements an IPW approach for longitudinal data with missing observations. tsrobprep implements imputation of missing values using a robust decomposition of the time series.
For more specific time series, cold fits longitudinal count
models from data with missing values.
- Spatial data: Imputation for spatial data is implemented in the package rtop, which performs geostatistical interpolation of irregular areal data, and in areal, which performs areal weighted interpolation using a tidyverse data management.
Interpolation of spatial data based on genetic distances is also available in phylin.
- Spatio-temporal data (see also SpatioTemporal): Imputation for spatio-temporal data is implemented in the package StempCens with a SAEM approach that approximates EM when the E-step does not have an analytic form.
From an application perspective, gapfill is dedicated to the imputation of satellite data observed at equally-spaced points in time and foster to the imputation of satellite data based on observed predictors. momentuHMM is dedicated to the analysis of telemetry data using generalized hidden Markov models (including multiple imputation for missing data).
- Survival data: Multiple imputation for the estimation of cumulative incidence functions is implemented in kmi.
- Distance matrices: Imputation for Euclidean distance matrix is implemented in edmcr, using different optimization approaches.
- Graphs/networks: missSBM imputes missing edges in Stochastic Block models, cglasso implements an extension of the Graphical Lasso inference from censored and missing value measurements, and bnstruct provides an extension of various methods for Bayesian network inference from data with missing values. Oriented toward inference of species community networks, eicm uses an extension of binomial GLM that handles missing values.
- Imputation for contingency tables is implemented in lori that can also be used for the analysis of contingency tables with missing data.
- Imputation for compositional data (CODA) is implemented in robCompositions and zCompositions (various imputation methods for zeros, left-censored and missing data).
- Rank models with partially missing rankings are handled in BayesMallows with Bayesian methods, and in irrNA to compute inter-rater reliability and concordance.
- Experimental design: experiment handles missing values in experimental design such as randomized experiments with missing covariate and outcome data, and matched-pairs design with missing outcome.
- Recurrent events: dejaVu performs multiple imputation of recurrent event data based on a negative binomial regression model.
- Regression and classification: many different supervised methods can accommodate the presence of missing values. randomForest, grf, and StratifiedRF handle missing values in predictors in various random forest based methods. misaem handles missing data in linear and logistic regression and allows for model selection. psfmi also provides a framework for model selection for various linear models in multiply imputed datasets. naivebayes provides an efficient implementation of the naive Bayes classifier in the presence of missing data. plsRbeta implements PLS for beta regression models with missing data in the predictors. lqr provides quantile regression estimates based on various distributions in the presence of missing values and censored data. eigenmodel handles missing values in regression models for symmetric relational data.
- Clustering: biclustermd handles missing data in biclustering. RMixtComp, MGMM, and mixture fit various mixture models in the presence of missing data. ClustImpute deals with missing values in k-means clustering.
- Tests for two-sample paired missing data are implemented in robustrank and MKinfer, the latter is based on multiple imputed datasets. Reliability of tests for data with missing values is assessed with a Bayesian approach in brxx.
- Meta-analysis: metavcov offers a collection of functions, including multiple imputations for missing data, for multivariate meta-analyses. More specifically, imputation for meta-analyses of binary outcomes is provided in metasens and NMADiagT provides a Bayesian analysis using network meta-analysis of dose response studies in which MNAR missing values are accounted for.
- Outlier detection (and robust analysis) in the presence of missing values is implemented in GSE and rrcovNA.
- ROC estimation in the presence of missing values is available in bcROCsurface for ROC surface and in BLOQ for left censored data.
Specific application fields
- Genetics: Imputation of SNP data is implemented in alleHap (using solely deterministic techniques based on pedigree data), in QTLRel (using information on flanking SNPs), in snpStats (using a nearest neighbor approach), in HardyWeinberg (using multiple imputations with a multinomial model based on allele intensities and/or flanking SNPs).
qgtools includes linear mixed models and resampling techniques for quantitative genetics analyses in the presence of missing data. EM algorithm is used to compute genetic statistics for population in the presence of missing SNP in StAMPP. SCAT (archived) implements a conditional association test that adjusts for heterogeneity in SNP coverage and thus for missing data in SNP values.
Finally, FILEST is used to simulate SNP datasets with outlying individuals and missing values.
- Phylogeny: Imputation of missing data for phylogeny is implemented in Rphylopars with different evolutionary models. Simulation of incomplete phylogeny can be performed with TreeSim.
- Genomics: Imputation for dropout events (i.e., under-sampling of mRNA molecules) in single-cell RNA-Sequencing data is implemented in DrImpute, SAVER, and iCellR, and is based, respectively, on clustering of cells, Markov affinity graph, an empirical Bayes approach, and k-nearest neighbors. The first three packages are used and combined in scRecover and ADImpute and the last one can also handle other types of single-cell data, such as scATAC-Seq or CITE-Seq.
RNAseqNet uses hot-deck imputation to improve RNA-seq network inference with an auxiliary dataset.
- Chemometrics: Various functions to analyze the missing value mechanism and to impute missing values (using multiple imputation) in LC-MS/MS spectra is available in imp4p for protein quantification. More specifically, wrProteo provide multiple replacement of missing values by low random values in quantitative proteomics data.
Imputation of data under detection limit for NIR spectra is provided in NIRStat for standard analyses of NIR time series.
- Epidemiology: bayesCT implements various methods for simulation and analysis of clinical trials in a Bayesian framework that allows for handling and imputation of missing data. sanon implements a method for analysis of randomized clinical trials with strata that can handle MCAR data.
More specifically, idem implements a procedure for comparing treatments in clinical trials with missed visits or premature withdrawal. InformativeCensoring implements multiple imputation for informative censoring. pseval evaluates principal surrogates in a single clinical trial in the presence of missing counterfactual surrogate responses. sievePH implements continuous, possibly multivariate, mark-specific hazard ratio with missing values in multivariate marks using an IPW approach. icenReg performs imputation for censored responses for interval data.
- Health: missingHE implements models for health economic evaluations with missing outcome data. accelmissing provides multiple imputation with the zero-inflated Poisson lognormal model for missing count values in accelerometer data.
- Morphometry: LOST can be used to simulate missing morphometric data randomly, with taxonomic bias and with anatomical biases.
- Environment: AeRobiology imputes missing data in aerobiological datasets imported from aerobiological public databases. QUALYPSO can handle missing data and provides unbiased estimates of climate change responses for incomplete ensembles of climate projections.
- Causal inference: Various methods for causal inference with missing data are implemented in targeted, using augmented IPW estimators. Causal inference with interactive fixed-effect models is available in gsynth, with missing values handled by matrix completion, and in dosearch, via extension of do-calculus to missing data. MatchThem matches multiply imputed datasets using several matching methods, and provides users with tools to estimate causal effects in each imputed dataset. grf offers treatment effect estimation with incomplete confounders and covariates under modified unconfoundedness assumptions and RCAL implements regularized calibrated estimation for causal inference with missing values and high dimension.
- Finance: imputeFin handles imputation of missing values in financial time series using AR models or random walk.
- Scoring: Basic methods (mean, median, mode, ...) for imputing missing data in scoring datasets are proposed in scorecardModelUtils.
- Preference models: Missing data in preference models are handled with a composite link approach that allows for MCAR and MNAR patterns to be accounted for in prefmod.
- Administrative records / Surveys: BIFIEsurvey is a very general package that contains tools for survey statistics and that can handle multiply imputed datasets. More specifically, fastLink provides a Fellegi-Sunter probabilistic record linkage that allows for missing data and the inclusion of auxiliary information.
convergEU can process data from Eurostat data and impute missing values to monitor convergence between EU countries. eechidna has similar features for Australian election and public census datasets.
- Bibliometry: robustrao computes the Rao-Stirling diversity index (a well-established bibliometric indicator to measure the interdisciplinarity of scientific publications) with data containing uncategorized references. metagear provides hot-deck imputation in bibliographic data for systematic reviews and meta-analysis.
|Core:||Amelia, hot.deck, imputeTS, jomo, mice, missMDA, naniar, softImpute, VIM, yaImpute.|
|Regular:||accelmissing, ade4, AeRobiology, alleHap, areal, bayesCT, BayesMallows, bcROCsurface, biclustermd, BIFIEsurvey, BLOQ, BMTAR, bnstruct, bootImpute, brxx, CALIBERrfimpute, cat, cglasso, ClustImpute, CMF, cmfrec, cobalt, CoImp, cold, convergEU, CRTgeeDR, dejaVu, denoiseR, DescTools, dlookr, dosearch, DrImpute, DTSg, DTWBI, DTWUMI, ECLRMC, edmcr, eechidna, eicm, eigenmodel, eRm, experiment, fastLink, FHDI, FILEST, filling, forecast, foster, FSMUMI, gapfill, grf, GSE, gsynth, HardyWeinberg, Hmisc, iai, iCellR, icenReg, idem, ie2misc, imp4p, impimp, imputeFin, imputeMulti, imputeR, imputeTestbench, InformativeCensoring, ipw, IPWboxplot, irrNA, Iscores, isni, isotree, JointAI, kmi, lavaan, LNIRT, lodi, lori, LOST, lqr, ltm, MatchThem, mde, mdgc, mdmb, memisc, metagear, metasens, metavcov, MGMM, mi, miceadds, miceFast, micemd, miceRanger, mimi, mirt, misaem, missCompare, missForest, missingHE, missMethods, missRanger, missSBM, misty, mitml, mitools, miWQS, mix, mixture, MKinfer, MLCIRTwithin, mlmi, MMDai, momentuHMM, monomvn, NADIA, naivebayes, nipals, NIRStat, NMADiagT, norm, norm2, NPBayesImputeCat, OpenMx, padr, pan, phylin, plsRbeta, plsRglm, ppmSuite, prefmod, PReMiuM, primePCA, prophet, pseval, psfmi, qgtools, QTLRel, Qtools, QUALYPSO, randomForest, RBtest, RCAL, retroharmonize, rMIDAS, RMixtComp, RMixtCompIO, RMixtCompUtilities, RNAseqNet, robCompositions, robustrank, robustrao, ROptSpace, Rphylopars, rrcovNA, rsem, rsparse, rtop, samon, sanon, SAVER, scorecardModelUtils, semTools, sievePH, simFrame, simglm, simputation, simsem, sjlabelled, sjmisc, smcfcs, spacetime, StAMPP, StatMatch, StempCens, stlplus, StratifiedRF, swgee, SynthTools, TAM, targeted, tensorBF, TestDataImputation, tidyr, timeSeries, TreeSim, tsibble, tsrobprep, VarSelLCM, wrangle, wrProteo, xts, zCompositions, zoo.|