Aug 2, 2017

AutoExploreR: An automated data exploration R package

0 comments

Edited: Aug 2, 2017

 

Many recent posts in the data science media have emphasised the importance of the modern data scientist, who not only knows about statistics, machine learning and programming with R and Python, but also knows about cloud technology, parallelising code and software development for designing better tools to efficiently automate parts of the data science life cycle. At Elastacloud, innovating new tools, productionising data science and evaluating its impact are processes we aim incorporate in our projects. Our latest development the “AutoExploreR” package illustrates just this.

 

We follow the CRISP-DM data science life cycle. As shown below the cycle consists of several stages: business understanding, data understanding, data preparation or cleaning, modelling, evaluation and deployment. As we have often encountered data understanding and data cleaning are often the most time-consuming parts of the life cycle. Typically, functions of use for data understanding are often scattered in different packages and visualisations require excessive effort to make presentable in reports.

 

 

The AutoExploreR package was developed with the hope of partly automating the data understanding stage of the life cycle. Currently the package contains several key functions which allow the user to gain rapid insight into their data, to identify which variables maybe of use for modelling, what types of cleaning processes and feature engineering processes may be required. A brief overview of the functionality is provided in the table below.

 

 

Most importantly the package also auto-generates reports with all the essential outputs and interactive plots. Reducing laborious documentation efforts and data understanding time, win-win!! More time for the fun stuff!

 

 

New Posts
  • AutoExploreR is an open source R package that can be used during the data exploration/understanding stage of the data science life cycle. At ElastaCloud we have found this part of the process to be particularly time consuming. When using R, a problem that we often encounter, is that to do seemingly simple things, we either have to find and install multiple packages or write our own new functions as we go. With that in mind we decided to create our own package that could carry out our most common tasks in a simple, but reliable way. This post will introduce just some of the functionality included in AutoExploreR and show some examples of their use on a real data set. The tasks that we will showcase include Calculating and visualising correlations Identifying outliers The data used is the 'swiss' data set, available with an installation of R, which gives measures of fertility and socio-economic indicators for the French speaking provinces of Switzerland in 1888. Correlations When we have numeric data we usually want to quickly know how the various variables are related to one another. Calculating correlations is a commonly used method, but it can be difficult in R to calculate and then visualise the multiple correlations in larger data sets without installing multiple packages and somehow joining their outputs together. In AutoExploreR we have developed three functions, targetCorrelation, multivariateCorrelation and autoCorrelationPlot that make this process very easy. The targetCorrelations function automatically calculates all the correlations between a ‘target’ variable and all other numerical variables in a data set, whereas the multivariateCorrelation function calculates all correlations between numerical variables, automatically ignoring none-numeric data. With the output argument set to “matrix” the result of the multivariateCorrelation function can be passed to autoCorrelationPlot to automatically produce a correlation plot, as shown below ( click here to see an interactive version of the correlation plot ). Multivariate Outliers We want to know if any data points are outliers; these could be points that contain erroneous data, or are particularly interesting as they are so different to what is 'normal' for the data set. AutoExploreR has capability to identify outliers in univariate and multivariate data; here we look at the MultivariateOutlier function. When provided with a data set the function will automatically find the optimal parameters for the outlier detection procedure, return a dataframe showing the location of any outliers in the set and a plot, with reduced dimensions, highlighting the outlier. The results of the multivariate outlier for the Swiss data identified one outlier, Geneva, which has low fertility, very low agriculture and very high education compared to the other provinces. It is likely that in this case this data point is an outlier due to actual differences in the socio-economic development of the provinces, rather than being erroneous data. In this post, I have given a gentle introduction into a few of the capabilities of the AutoExploreR package that we at ElastaCloud have found particularly useful so far. In a following post I will discuss some of the report generating functions of the package; also stay tuned for further posts as the package is under constant development!