Apr 18, 2018

Data imputation in R and Python

0 comments

 

Missing data is problematic for machine learning models. The majority of models do not run when missing values are present. Although there are a few algorithms (e.g. k-nearest neighbours) that can handle the presence of missing values, when your data is limited in size and you want evaluate the performance of different algorithms, removing rows with missing values is chucking away valuable information.

 

On the contrary others argue that there is a danger of adding artificial relationships into the data when it is imputed. However, it generally accepted in industry if the proportion of missing data is small (<10%) then the risk of introducing bias through imputation is minimal. Quick fixes such as replacing with missing values with the mean or median are often convenient they are more likely to introduce bias. For example, imputation with the mean is likely to not change the mean but reduce the variance, which may be undesirable.

 

Below I demonstrate how to impute using the multivariate imputation via chained equations (MICE) with the ‘mice’ package in R and the ‘impyute’ package in python. There is another python package, 'fancy impute' that I began using initially, but installation is not straightforward, and whilst it capable of imputing using a variety of algorithms such as mice, knn, iterativeSVD, etc it does have performance issues. Prior to data imputation it is important to establish whether the data is missing completely at random (MCAR), missing at random (MAR) or missing not at random (MNAR) (see here to identify the type of missing data you have).

 

Data imputation in R with the 'mice' package

The code snippet below shows the loading of the iris data set and 10% of values converted to missing values. Thereafter we visualise the missing values to establish if there are any patterns.

library(mice)
data <- iris
summary(iris)
#Generate 10% missing values at Random 
library(missForest)
iris.mis <- prodNA(iris, noNA = 0.1)
md.pattern(iris.mis)

#visualise missing values
install.packages("VIM")
library(VIM)
mice_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(iris.mis), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

 

 

As illustrated on the right plot there is 67% of values in the data set with no missing values. The histogram on the left shows there is ~12% missing values in Petal.Length, ~11% missing values in Species, ~10% missing values in Sepal.Width and so forth. Now we have established that there are adequately low number of missing values with no pattern, we are ready to impute.

 

The code snippet below shows data imputation with mice. The parameter m refers to the number of imputed data sets to create and maxit refers to the number of iterations. The effects of these parameters are clear in the live output generated in the R console when the code is run, as shown below. There are several methods to choose from: predictive mean matching (pmm), logistic regression(logreg), Bayesian polytomous regression (polyreg) and proportional odds model.

 

imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)
summary(imputed_Data)

 

 

Now we have 5 versions of our imputed data set we can save our complete data set, as shown below, where we have opted to use the second data set generated. You can also build models on all 5 data sets using the with() command, and also combine the results from these models using the pool() command.

#check imputed values
imputed_Data$imp$Sepal.Width

#Since there are 5 imputed data sets, you can select any using complete() function.
#get complete data ( 2nd out of 5)
completeData <- complete(imputed_Data,2)
summary(completeData)

 

Data imputation in Python with 'impyute'

Now we have seen how to do data imputation in R, lets take a look at how to do imputation in python. The impyute package is easy to install and use, see the link for more information.

 

The following code snippet uses publicly available road traffic accident data to show how easy data imputation is in python with the impyute package.

 

import pandas as pd
import numpy as np
from impyute import *

#ReadData
RTA = pd.read_csv("C:\\Users\\darsh\\OneDrive - Elastacloud Limited\\Meetups\\IWDS\\session 8- EDA\\Data\\Accidents2016.csv")

#ExtractNumericColumns
RTA_num = RTA.iloc[:,[1,2,3,4,5,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]]
RTA_num.info()

RTA_num.isnull().sum()

#ImputeData
mice_fill_RTA = pd.DataFrame(MICE().complete(RTA_num),columns = RTA_num.columns)
mice_fill_RTA.isnull().sum()

Like R, python also gives you a live output as performing the imputations, as shown below.

The figure below shows the number of missing values per column before and after imputation.

As we have shown we don't have to always lose valuable data as imputations can be easily done in R and Python.

Happy imputing!

New Posts
  • Microsoft have recently released an updated version of their Azure Machine learning service. At Elastacloud we have been using AML since the first release to deploy machine learning models to the cloud. AML provides a platform to develop, train, test, deploy, manage, and track machine learning models but it is mostly the deploy and manage part that Elastacloud have made use of, so my article is going to focus on these aspects. Based on feedback from the community Microsoft have made sweeping changes to the service which essentially mean it is a new product. Some of the major changes that users will have to adjust to are: - No workbench Don’t need individual experimentation and modelmanagement accounts, just a single workspace New Python SDKs New Azure Machine Learning CLI extension My view on the removal of the workbench is neutral, I never used it previously other than to launch the CLI. My understanding is that it was an underused tool, with most people preferring to develop their models in an IDE such as VS Code or even in Jupyter Notebooks. There are two new Python SDKs; the Machine learning SDK and the Data prep SDK. The ML SDK, in Microsoft’s words, “is used by data scientists and AI developers to build and run machine learning workflows upon the Azure Machine Learning service”. In a recent Elastacloud project we have deployed a number of predictive machine learning models, as a web service, for a customer using the new AML service. I used the Machine learning SDK to complete this task and after just a few teething problems I found that the SDK was easy to work with and certainly felt like a tool that made me more productive. One of the requirements for this web service required some consideration on how to best create the service; different models should be loaded and used to generate the predictions based on the day of the week. This requirement arises because our customer always needs forecasts for the next two days and the next two business days, meaning that on a Friday, for example, they want forecasts for the next two days (Saturday and Sunday) and the next two business days (Monday and Tuesday) whereas on a Monday they only need them for the next two business days (Tuesday and Wednesday). Therefore, we have three different models (one day ahead, two days ahead, weekend) deployed to the same service, with multiple dependencies (e.g. *.py files). Azure ML made the creation of the service very easy, as demonstrated in the code examples below. Creating a Docker image Models are deployed in Docker images to Azure Kubernetes Service (AKS) or Azure Container Instances (ACI). The code excerpt below shows how simple it is to create this image in only two lines; image_config contains the required score.py file alongside the optional dependencies and a conda .env file. The image is then built with ContainerImage.create where the already registered models are provided. Deploying the service Once we have a successfully built Docker image we can deploy it to AKS or ACI as a web service. This, again, is very easy to do with the SDK as shown in the image below. We only need to define a configuration, with AksWebservice.deploy_configuration() (gives default configuration), then use the deploy_from_image method, providing our Docker image as one of the arguments. And so we have successfully deployed our machine learning models as a web service! Now we can get the scoring URI ( aks_service.scoring_uri ) and the access keys ( aks_service.get_keys ) and start making requests to our machine learning models.
  • Extreme Gradient Boosting (xgboost) is a very fast, scalable implementation of gradient boosting that has taken the data science world by storm, with xgboost regularly online data science competitions and use at scale across different industries. Xgboost was originally developed by Tiangi Chen and is renowned for execution speed and model performance. I have recently been conducting some experiments with xgboost for the Renewables AI products. Below I show how to run a simple regression type tree and linear based model. Thereafter we go on to explore grid search and random search with xgboost. But first a little background information… Boosting is what gives xgboost it’s state of the art performance. Boosting is not a specific machine learning algorithm, but a concept that can be applied to set of machine learning algorithms, hence boosting is known as a meta algorithm. Essentially, xgboost is an ensemble method, used to convert many weak learners (models performing slightly better than chance) to a strong learner. This is achieved via boosting, where a set of weak learners on subsets of the data is iteratively learnt. Each weak learner is weighted according to performance. Thereafter, each weak learner’s predictions are combined and multiplied by their weight to obtain a final weighted prediction, which is better than any of the individual predictions themselves. The Python API is capable of running the xgboost on regression and classification problems, using decision tree and linear learners. Below we apply xgboost to regression type problem using a tree-based learner. Decision trees are an iterative contruction of binary decisions (one decision at a time) until a stopping criterion is met (ie. The majority of one decision split consists of one category/value or another). Individual trees tend to overfit (low bias, high variance), hence perform well on training data but don’t generalise as well, hence ensemble methods are useful in this scenario. Notice as this is a regression type problem we use the loss function "reg:linear", whereas for a classification problem we would use "reg:logistic" or "binary: logistic" depending on whether you are interested in the class or the probability of the class. A loss function maps the difference between the actual and predicted values - we aim to find the model with the lowest loss function. import numpy as np import pandas as pd from sklearn.metrics import r2_score import xgboost as xgb from sklearn.metrics import mean_squared_error from xgboost import plot_tree X_train = pd.read_csv("X_train.csv") Y_train = pd.read_csv("Y_train.csv") X_test = pd.read_csv("X_test.csv") Y_test = pd.read_csv("\Y_test.csv") list(X_train) list(X_test) #################################### # XGBoost Decision Tree ################################### xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed= 123) xg_reg.fit(X_train,Y_train) #XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=10, # n_jobs=1, nthread=None, objective='reg:linear', random_state=0, # reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, # silent=True, subsample=1) preds = xg_reg.predict(X_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse))#RMSE: 164.866642 r2 = r2_score(Y_test, preds) # Plot the first tree xgb.plot_tree(xg_reg,num_trees=0) plt.show() In the following code we apply xgboost to a regression type problem using linear based learners. ##################################### # XGBoost Linear Regression ##################################### DM_train = xgb.DMatrix(data=X_train, label=Y_train) DM_test = xgb.DMatrix(data=X_test, label=Y_test) params = {"booster":"gblinear", "objective":"reg:linear"} xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10) preds = xg_reg.predict(DM_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse)) #RMSE: 169.731848 r2 = r2_score(Y_test, preds) Like many other algorithms, performance can be enhanced by tuning the hyperparameters. Below shows an example of a xgboost grid search. Grid search can be quite computationally expensive as we exhaustively search over a given set of hyperparameters, and pick the best performing hyperparameters. For example, if we have 2 hyperparameters to tune and 4 possible values for each parameter, that’s 16 possible parameter configurations. An alterative to grid search is random search, where you can define how many models/iterations to try before stopping. During each iteration, the algorithm randomly selects a value in the range specified for each hyperparameter. import pandas as pd import xgboost as xgb import numpy as np from sklearn.model_selection import GridSeachCV housing_data = pd.read_csv("ames_housing_trimmed_processed.csv") X, y = housing_data[housing_data.columns_tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]] housing_dmatrix = xgb.DMatrix(data=X, label=y) gbm_param_grid = {'learning_rate':[0.01,0.1,0.5,0.9], 'n_estimators': [200], 'subsample': [0.3,0.5,0.9]} gbm = xgb.Regressor() grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 4, verbose = 1) grid_mse.fit(X,y) print("Best parameters found: ", grid_mse.best_params_) print("lowest RMSE: ", np.sqrt(np.abs(grid_mse.best_score_))) A quick overview of the hyperparameters that can be tuned for tree based models: - eta/learning rate (how quickly the model fits residual error using additional base lase learners -gamma; minimum loss reduction to create new tree split -lambda: L2 regularisation on lead weights -alpha: L1 regularisation on leaf weights -max depth: how big a tree can grow -subsample: Percentage of sample that can be used for any given boosting round -colsample_tree: the fraction of features that can be called on during any boosting round (ranges from 0-1) An overview of hyperparameters that can be tuned for linear learners: -lambda: L2 regularisation on weights -alpha: L1 regularisation on weights -lambda_bias: L2 regularisation term on bias Another useful blog posts related to xgboost can be found here . Happy experimenting!
  • I have had the opportunity to speak at various DataScience MeetUp events in Nottingham, Loughborough and London. Typically, rather than use the usual PowerPoint presentation, I prefer running LIVE codes as this gives the audience the confidence that my codes work and is repeatable. The challenge with this for me however is that I tend to do most of my preparation at the office and end up using my personal laptop for the eventual presentation. This means that I run a 'office computer' - github repo - personal computer back and forth triangle. Not until a colleague in the office introduced me to azure notebooks (thanks Darren!). Microsoft Azure Notebooks is a free service that provides Jupyter Notebooks along with supporting packages for R, Python and F#. Using this notebooks is easy! All one needs is a free account at www.notebooks.azure.com. Azure notebooks uses libraries for grouping notebooks. For example, I now have a library for my MeetUp events based on location Once you are signed in and you've created a new library by clicking the '+New Library' button, you can use the '+New' button in the library environment to create a notebook by clicking the 'Item type' drop box. At the moment, it supports Python(2.7, 3.5, 3.6), R and F#. If you are fussy about organisation, it also allows you to create a folder instead and then create your files. In addition to creating new files, you can also load files from a URL or from your computer. That's not all! The notebook file on azure has a cool slide presentation feature called the RISE Slideshow. This RISE Slideshow is a notebook extension which allows you to use it for presentations. To enable Slide mode: In your notebook click View/Cell Toolbar/Slide Show For each cell select its type and hierarchy on the right hand side To start the presentation, click the "graph" icon (shown above) on the main toolbar.  Use left/right/up/down to navigate slides. This automatically turns your notebook into a slide presentation! The coolest thing about this azure notebooks is it's all on the cloud. This means that you can present using any computer/laptop (even if there's no python installed). And because of this, you do not need to pip install any library on the local machine. It's all done on the cloud Note that your packages will only be available for the lifetime of your notebook server and your notebook server will typically shutdown after 1 hour of inactivity. Azure Notebooks also lets you auto-setup your environment if you have a pip requirements file. There's so much to it and I'll just let you explore! I hope this is as useful for you as it was (and still is) for me. Let me know if you've used this before and what you like or do not like about it. For suggestions on other platforms especially for beginners to use for Python/R, see this post by Laura Da Silva