Oct 25, 2017

Project Template for Data Science in R

0 comments

Edited: Oct 25, 2017

 

Every time that you start a new project in Data Science, you need to create several folders to locate the different inputs and outputs of your project. So, for each project, you need to decide where you will store your datasets, your scripts, your images, etc. and also how you will name all these folders. That’s fine if you are working in your own as you have space for creativity, but when you are working in a team or collaborating with more people, this creativity could be a bit of a nightmare.

 

Then, you start thinking how nice would be to have a fixed structure for all your projects, because in this way, it doesn’t matter who is working in your project, you know exactly where you can find everything.

 

In this article, I would like to introduce you an R package called “ProjectTemplate” that I find quite useful for this purpose. It will help you not only to organize the files in your project, but also, it will load for you all the R packages and data sets needed in your project.

 

Note: Be sure that you have already installed R version 2.7 or higher working in your RStudio, otherwise it may not work as expected. You can see your R version in your R console.

 

The first step is the installation of this package. You can simply follow the instructions provided at: http://projecttemplate.net/installing.html

 

For this example, we will use the IRIS dataset. You can get some information about this dataset and the CSV from: https://archive.ics.uci.edu/ml/datasets/iris

 

Creating your first project using ProjectTemplate package:

1. Start a new session in RStudio and open an R script.

File > New File > R Script

2. In your script, set your working directory using setwd() and check out that you got it right using getwd(). For example:

3. Then, we can create our project in the working directory that we have chosen. We have two different options when creating a working directory: the minimal version or the full version.

Let’s create both projects and see how they look like. For the minimal version, we will do the following:

If we explore the folder called RprojectMin, will see that ProjectTemplate has generated automatically the following folders:

While if we do the following:

It will take by default the full structure and it will generate automatically the following folders:

 

 

 

 

 

 

 

 

Details of the meaning and purpose of the different folders can be find at: http://projecttemplate.net/architecture.html. You will also find a README.md file inside each folder giving a general explanation of the goal of that particular folder.

 

 

 

 

 

 

 

 

 

Finally, we copy our IRIS dataset to our folder called data and load the project. In this way, we will load all the data in memory automatically and we can start working with it.

If you do it directly, you may find this error:

So, we have to set our working directory to the path where we have the ProjectTemplate directory, that is, the RProjectMin or RProjectFull that we already created. For example, if we have added the data to our data folder inside the RProjectMin folder, we can do the following:

Then, you will see that you’ve got automatically all your data in memory:

The purpose of this package is to work with a specific dataset and not really with general datasets. If you want to develop a solution for general datasets, you may find interesting the creation of your own packages instead of using ProjectTemplate package. You can find more info at http://projecttemplate.net/packages.html

New Posts
  • Microsoft have recently released an updated version of their Azure Machine learning service. At Elastacloud we have been using AML since the first release to deploy machine learning models to the cloud. AML provides a platform to develop, train, test, deploy, manage, and track machine learning models but it is mostly the deploy and manage part that Elastacloud have made use of, so my article is going to focus on these aspects. Based on feedback from the community Microsoft have made sweeping changes to the service which essentially mean it is a new product. Some of the major changes that users will have to adjust to are: - No workbench Don’t need individual experimentation and modelmanagement accounts, just a single workspace New Python SDKs New Azure Machine Learning CLI extension My view on the removal of the workbench is neutral, I never used it previously other than to launch the CLI. My understanding is that it was an underused tool, with most people preferring to develop their models in an IDE such as VS Code or even in Jupyter Notebooks. There are two new Python SDKs; the Machine learning SDK and the Data prep SDK. The ML SDK, in Microsoft’s words, “is used by data scientists and AI developers to build and run machine learning workflows upon the Azure Machine Learning service”. In a recent Elastacloud project we have deployed a number of predictive machine learning models, as a web service, for a customer using the new AML service. I used the Machine learning SDK to complete this task and after just a few teething problems I found that the SDK was easy to work with and certainly felt like a tool that made me more productive. One of the requirements for this web service required some consideration on how to best create the service; different models should be loaded and used to generate the predictions based on the day of the week. This requirement arises because our customer always needs forecasts for the next two days and the next two business days, meaning that on a Friday, for example, they want forecasts for the next two days (Saturday and Sunday) and the next two business days (Monday and Tuesday) whereas on a Monday they only need them for the next two business days (Tuesday and Wednesday). Therefore, we have three different models (one day ahead, two days ahead, weekend) deployed to the same service, with multiple dependencies (e.g. *.py files). Azure ML made the creation of the service very easy, as demonstrated in the code examples below. Creating a Docker image Models are deployed in Docker images to Azure Kubernetes Service (AKS) or Azure Container Instances (ACI). The code excerpt below shows how simple it is to create this image in only two lines; image_config contains the required score.py file alongside the optional dependencies and a conda .env file. The image is then built with ContainerImage.create where the already registered models are provided. Deploying the service Once we have a successfully built Docker image we can deploy it to AKS or ACI as a web service. This, again, is very easy to do with the SDK as shown in the image below. We only need to define a configuration, with AksWebservice.deploy_configuration() (gives default configuration), then use the deploy_from_image method, providing our Docker image as one of the arguments. And so we have successfully deployed our machine learning models as a web service! Now we can get the scoring URI ( aks_service.scoring_uri ) and the access keys ( aks_service.get_keys ) and start making requests to our machine learning models.
  • Extreme Gradient Boosting (xgboost) is a very fast, scalable implementation of gradient boosting that has taken the data science world by storm, with xgboost regularly online data science competitions and use at scale across different industries. Xgboost was originally developed by Tiangi Chen and is renowned for execution speed and model performance. I have recently been conducting some experiments with xgboost for the Renewables AI products. Below I show how to run a simple regression type tree and linear based model. Thereafter we go on to explore grid search and random search with xgboost. But first a little background information… Boosting is what gives xgboost it’s state of the art performance. Boosting is not a specific machine learning algorithm, but a concept that can be applied to set of machine learning algorithms, hence boosting is known as a meta algorithm. Essentially, xgboost is an ensemble method, used to convert many weak learners (models performing slightly better than chance) to a strong learner. This is achieved via boosting, where a set of weak learners on subsets of the data is iteratively learnt. Each weak learner is weighted according to performance. Thereafter, each weak learner’s predictions are combined and multiplied by their weight to obtain a final weighted prediction, which is better than any of the individual predictions themselves. The Python API is capable of running the xgboost on regression and classification problems, using decision tree and linear learners. Below we apply xgboost to regression type problem using a tree-based learner. Decision trees are an iterative contruction of binary decisions (one decision at a time) until a stopping criterion is met (ie. The majority of one decision split consists of one category/value or another). Individual trees tend to overfit (low bias, high variance), hence perform well on training data but don’t generalise as well, hence ensemble methods are useful in this scenario. Notice as this is a regression type problem we use the loss function "reg:linear", whereas for a classification problem we would use "reg:logistic" or "binary: logistic" depending on whether you are interested in the class or the probability of the class. A loss function maps the difference between the actual and predicted values - we aim to find the model with the lowest loss function. import numpy as np import pandas as pd from sklearn.metrics import r2_score import xgboost as xgb from sklearn.metrics import mean_squared_error from xgboost import plot_tree X_train = pd.read_csv("X_train.csv") Y_train = pd.read_csv("Y_train.csv") X_test = pd.read_csv("X_test.csv") Y_test = pd.read_csv("\Y_test.csv") list(X_train) list(X_test) #################################### # XGBoost Decision Tree ################################### xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed= 123) xg_reg.fit(X_train,Y_train) #XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=10, # n_jobs=1, nthread=None, objective='reg:linear', random_state=0, # reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, # silent=True, subsample=1) preds = xg_reg.predict(X_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse))#RMSE: 164.866642 r2 = r2_score(Y_test, preds) # Plot the first tree xgb.plot_tree(xg_reg,num_trees=0) plt.show() In the following code we apply xgboost to a regression type problem using linear based learners. ##################################### # XGBoost Linear Regression ##################################### DM_train = xgb.DMatrix(data=X_train, label=Y_train) DM_test = xgb.DMatrix(data=X_test, label=Y_test) params = {"booster":"gblinear", "objective":"reg:linear"} xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10) preds = xg_reg.predict(DM_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse)) #RMSE: 169.731848 r2 = r2_score(Y_test, preds) Like many other algorithms, performance can be enhanced by tuning the hyperparameters. Below shows an example of a xgboost grid search. Grid search can be quite computationally expensive as we exhaustively search over a given set of hyperparameters, and pick the best performing hyperparameters. For example, if we have 2 hyperparameters to tune and 4 possible values for each parameter, that’s 16 possible parameter configurations. An alterative to grid search is random search, where you can define how many models/iterations to try before stopping. During each iteration, the algorithm randomly selects a value in the range specified for each hyperparameter. import pandas as pd import xgboost as xgb import numpy as np from sklearn.model_selection import GridSeachCV housing_data = pd.read_csv("ames_housing_trimmed_processed.csv") X, y = housing_data[housing_data.columns_tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]] housing_dmatrix = xgb.DMatrix(data=X, label=y) gbm_param_grid = {'learning_rate':[0.01,0.1,0.5,0.9], 'n_estimators': [200], 'subsample': [0.3,0.5,0.9]} gbm = xgb.Regressor() grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 4, verbose = 1) grid_mse.fit(X,y) print("Best parameters found: ", grid_mse.best_params_) print("lowest RMSE: ", np.sqrt(np.abs(grid_mse.best_score_))) A quick overview of the hyperparameters that can be tuned for tree based models: - eta/learning rate (how quickly the model fits residual error using additional base lase learners -gamma; minimum loss reduction to create new tree split -lambda: L2 regularisation on lead weights -alpha: L1 regularisation on leaf weights -max depth: how big a tree can grow -subsample: Percentage of sample that can be used for any given boosting round -colsample_tree: the fraction of features that can be called on during any boosting round (ranges from 0-1) An overview of hyperparameters that can be tuned for linear learners: -lambda: L2 regularisation on weights -alpha: L1 regularisation on weights -lambda_bias: L2 regularisation term on bias Another useful blog posts related to xgboost can be found here . Happy experimenting!
  • I have had the opportunity to speak at various DataScience MeetUp events in Nottingham, Loughborough and London. Typically, rather than use the usual PowerPoint presentation, I prefer running LIVE codes as this gives the audience the confidence that my codes work and is repeatable. The challenge with this for me however is that I tend to do most of my preparation at the office and end up using my personal laptop for the eventual presentation. This means that I run a 'office computer' - github repo - personal computer back and forth triangle. Not until a colleague in the office introduced me to azure notebooks (thanks Darren!). Microsoft Azure Notebooks is a free service that provides Jupyter Notebooks along with supporting packages for R, Python and F#. Using this notebooks is easy! All one needs is a free account at www.notebooks.azure.com. Azure notebooks uses libraries for grouping notebooks. For example, I now have a library for my MeetUp events based on location Once you are signed in and you've created a new library by clicking the '+New Library' button, you can use the '+New' button in the library environment to create a notebook by clicking the 'Item type' drop box. At the moment, it supports Python(2.7, 3.5, 3.6), R and F#. If you are fussy about organisation, it also allows you to create a folder instead and then create your files. In addition to creating new files, you can also load files from a URL or from your computer. That's not all! The notebook file on azure has a cool slide presentation feature called the RISE Slideshow. This RISE Slideshow is a notebook extension which allows you to use it for presentations. To enable Slide mode: In your notebook click View/Cell Toolbar/Slide Show For each cell select its type and hierarchy on the right hand side To start the presentation, click the "graph" icon (shown above) on the main toolbar.  Use left/right/up/down to navigate slides. This automatically turns your notebook into a slide presentation! The coolest thing about this azure notebooks is it's all on the cloud. This means that you can present using any computer/laptop (even if there's no python installed). And because of this, you do not need to pip install any library on the local machine. It's all done on the cloud Note that your packages will only be available for the lifetime of your notebook server and your notebook server will typically shutdown after 1 hour of inactivity. Azure Notebooks also lets you auto-setup your environment if you have a pip requirements file. There's so much to it and I'll just let you explore! I hope this is as useful for you as it was (and still is) for me. Let me know if you've used this before and what you like or do not like about it. For suggestions on other platforms especially for beginners to use for Python/R, see this post by Laura Da Silva