Many recent posts in the data science media have emphasised the importance of the modern data scientist, who not only knows about statistics, machine learning and programming with R and Python, but also knows about cloud technology, parallelising code and software development for designing better tools to efficiently automate parts of the data science life cycle. At Elastacloud, innovating new tools, productionising data science and evaluating its impact are processes we aim incorporate in our projects. Our latest development the “AutoExploreR” package illustrates just this.
We follow the CRISP-DM data science life cycle. As shown below the cycle consists of several stages: business understanding, data understanding, data preparation or cleaning, modelling, evaluation and deployment. As we have often encountered data understanding and data cleaning are often the most time-consuming parts of the life cycle. Typically, functions of use for data understanding are often scattered in different packages and visualisations require excessive effort to make presentable in reports.
The AutoExploreR package was developed with the hope of partly automating the data understanding stage of the life cycle. Currently the package contains several key functions which allow the user to gain rapid insight into their data, to identify which variables maybe of use for modelling, what types of cleaning processes and feature engineering processes may be required. A brief overview of the functionality is provided in the table below.
Most importantly the package also auto-generates reports with all the essential outputs and interactive plots. Reducing laborious documentation efforts and data understanding time, win-win!! More time for the fun stuff!