On the contrary others argue that there is a danger of adding artificial relationships into the data when it is imputed. However, it generally accepted in industry if the proportion of missing data is small (<10%) then the risk of introducing bias through imputation is minimal. Quick fixes such as replacing with missing values with the mean or median are often convenient they are more likely to introduce bias. For example, imputation with the mean is likely to not change the mean but reduce the variance, which may be undesirable.
Below I demonstrate how to impute using the multivariate imputation via chained equations (MICE) with the ‘mice’ package in R and the ‘impyute’ package in python. There is another python package, 'fancy impute' that I began using initially, but installation is not straightforward, and whilst it capable of imputing using a variety of algorithms such as mice, knn, iterativeSVD, etc it does have performance issues. Prior to data imputation it is important to establish whether the data is missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR) (see here to identify the type of missing data you have).
Data imputation in R with the 'mice' package
The code snippet below shows the loading of the iris data set and 10% of values converted to missing values. Thereafter we visualise the missing values to establish if there are any patterns.
As illustrated on the right plot there is 67% of values in the data set with no missing values. The histogram on the left shows there is ~12% missing values in Petal.Length, ~11% missing values in Species, ~10% missing values in Sepal.Width and so forth. Now we have established that there are adequately low number of missing values with no pattern, we are ready to impute.
The code snippet below shows data imputation with mice. The parameter m refers to the number of imputed data sets to create and maxit refers to the number of iterations. The effects of these parameters are clear in the live output generated in the R console when the code is run, as shown below. There are several methods to choose from: predictive mean matching (pmm), logistic regression(logreg), Bayesian polytomous regression (polyreg) and proportional odds model.
Now we have 5 versions of our imputed data set we can save our complete data set, as shown below, where we have opted to use the second data set generated. You can also build models on all 5 data sets using the with() command, and also combine the results from these models using the pool() command.
#check imputed valuesimputed_Data$imp$Sepal.Width#Since there are 5 imputed data sets, you can select any using complete() function.#get complete data ( 2nd out of 5)completeData <- complete(imputed_Data,2)summary(completeData)
Data imputation in Python with 'impyute'
Now we have seen how to do data imputation in R, lets take a look at how to do imputation in python. The impyute package is easy to install and use, see the link for more information.
The following code snippet uses publicly available road traffic accident data to show how easy data imputation is in python with the impyute package.