Oct 12, 2017

Shiny Leaflet maps

0 comments

Visualising data is an important part of the data science life cycle and any data scientist should know how to create good visualisations of different types of data. Recently I have been working with a lot of geospatial data which has meant that I’ve been putting some time into learning about plotting maps with R.

 

In my opinion one of the best R packages for doing this is ‘Leaflet’. Leaflet is actually an open-source JavaScript library for interactive maps. However, for the R user a package has been written that will allow you to integrate and control Leaflet maps through R.

 

Leaflet maps can be simply integrated into Shiny applications as any other visualisation, with renderLeaflet in the server function and leafletOutput in the user interface function.

 

To show just a couple of the things that I have liked about using Leaflet in Shiny I have created a small app (my code is available here if you want to have a play around yourself) using the City of York Council Road Safety data (link here). This data gives information of over 5000 personal injury accidents on public roads in York that are reported to the police, including the longitude and latitude coordinates.

 

The Shiny app shows the location of accidents on the map and a histogram of accident counts per day of the week. The app showcases how leaflet and Shiny can interact with each other in two ways. The first case shows how a Shiny input function can control what is shown on the map. The relevant server code, below, shows how the value of input$severity determines which accidents are passed to the leaflet function through the reactive accidentSeverity.

 

 

The second interaction shows how the map bounds can be used to control the output of other parts of the Shiny app. In this case the histogram automatically re-plots depending on which accidents are within the current map bounds. The relevant server code for this is shown below.

 

 

Note how the accidentsInView reactive defines an input called map_bounds which is used to filter the dataframe for the accidents that are currently being shown on the map. The input$map_bounds is a kind of special case for Leaflet map events, which usually send input to Shiny following the general pattern of input$MAPID_OBJCATEGORY_EVENTNAME, where MAPID is the map’s name (i.e. just “map” in this example), OBJCATEGORY could be a shape or marker on the map and EVENTNAME a mouse click or mouse hover.

 

I'm hoping to keep exploring Leaflet maps and their application in Shiny so keep an eye out for another article. In the mean time check out the Shiny Gallery, particularly the SuperZip example, which was a great help to me in learning how to use the Leaflet input events.

New Posts
  • Microsoft have recently released an updated version of their Azure Machine learning service. At Elastacloud we have been using AML since the first release to deploy machine learning models to the cloud. AML provides a platform to develop, train, test, deploy, manage, and track machine learning models but it is mostly the deploy and manage part that Elastacloud have made use of, so my article is going to focus on these aspects. Based on feedback from the community Microsoft have made sweeping changes to the service which essentially mean it is a new product. Some of the major changes that users will have to adjust to are: - No workbench Don’t need individual experimentation and modelmanagement accounts, just a single workspace New Python SDKs New Azure Machine Learning CLI extension My view on the removal of the workbench is neutral, I never used it previously other than to launch the CLI. My understanding is that it was an underused tool, with most people preferring to develop their models in an IDE such as VS Code or even in Jupyter Notebooks. There are two new Python SDKs; the Machine learning SDK and the Data prep SDK. The ML SDK, in Microsoft’s words, “is used by data scientists and AI developers to build and run machine learning workflows upon the Azure Machine Learning service”. In a recent Elastacloud project we have deployed a number of predictive machine learning models, as a web service, for a customer using the new AML service. I used the Machine learning SDK to complete this task and after just a few teething problems I found that the SDK was easy to work with and certainly felt like a tool that made me more productive. One of the requirements for this web service required some consideration on how to best create the service; different models should be loaded and used to generate the predictions based on the day of the week. This requirement arises because our customer always needs forecasts for the next two days and the next two business days, meaning that on a Friday, for example, they want forecasts for the next two days (Saturday and Sunday) and the next two business days (Monday and Tuesday) whereas on a Monday they only need them for the next two business days (Tuesday and Wednesday). Therefore, we have three different models (one day ahead, two days ahead, weekend) deployed to the same service, with multiple dependencies (e.g. *.py files). Azure ML made the creation of the service very easy, as demonstrated in the code examples below. Creating a Docker image Models are deployed in Docker images to Azure Kubernetes Service (AKS) or Azure Container Instances (ACI). The code excerpt below shows how simple it is to create this image in only two lines; image_config contains the required score.py file alongside the optional dependencies and a conda .env file. The image is then built with ContainerImage.create where the already registered models are provided. Deploying the service Once we have a successfully built Docker image we can deploy it to AKS or ACI as a web service. This, again, is very easy to do with the SDK as shown in the image below. We only need to define a configuration, with AksWebservice.deploy_configuration() (gives default configuration), then use the deploy_from_image method, providing our Docker image as one of the arguments. And so we have successfully deployed our machine learning models as a web service! Now we can get the scoring URI ( aks_service.scoring_uri ) and the access keys ( aks_service.get_keys ) and start making requests to our machine learning models.
  • Extreme Gradient Boosting (xgboost) is a very fast, scalable implementation of gradient boosting that has taken the data science world by storm, with xgboost regularly online data science competitions and use at scale across different industries. Xgboost was originally developed by Tiangi Chen and is renowned for execution speed and model performance. I have recently been conducting some experiments with xgboost for the Renewables AI products. Below I show how to run a simple regression type tree and linear based model. Thereafter we go on to explore grid search and random search with xgboost. But first a little background information… Boosting is what gives xgboost it’s state of the art performance. Boosting is not a specific machine learning algorithm, but a concept that can be applied to set of machine learning algorithms, hence boosting is known as a meta algorithm. Essentially, xgboost is an ensemble method, used to convert many weak learners (models performing slightly better than chance) to a strong learner. This is achieved via boosting, where a set of weak learners on subsets of the data is iteratively learnt. Each weak learner is weighted according to performance. Thereafter, each weak learner’s predictions are combined and multiplied by their weight to obtain a final weighted prediction, which is better than any of the individual predictions themselves. The Python API is capable of running the xgboost on regression and classification problems, using decision tree and linear learners. Below we apply xgboost to regression type problem using a tree-based learner. Decision trees are an iterative contruction of binary decisions (one decision at a time) until a stopping criterion is met (ie. The majority of one decision split consists of one category/value or another). Individual trees tend to overfit (low bias, high variance), hence perform well on training data but don’t generalise as well, hence ensemble methods are useful in this scenario. Notice as this is a regression type problem we use the loss function "reg:linear", whereas for a classification problem we would use "reg:logistic" or "binary: logistic" depending on whether you are interested in the class or the probability of the class. A loss function maps the difference between the actual and predicted values - we aim to find the model with the lowest loss function. import numpy as np import pandas as pd from sklearn.metrics import r2_score import xgboost as xgb from sklearn.metrics import mean_squared_error from xgboost import plot_tree X_train = pd.read_csv("X_train.csv") Y_train = pd.read_csv("Y_train.csv") X_test = pd.read_csv("X_test.csv") Y_test = pd.read_csv("\Y_test.csv") list(X_train) list(X_test) #################################### # XGBoost Decision Tree ################################### xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed= 123) xg_reg.fit(X_train,Y_train) #XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, # colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, # max_depth=3, min_child_weight=1, missing=None, n_estimators=10, # n_jobs=1, nthread=None, objective='reg:linear', random_state=0, # reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123, # silent=True, subsample=1) preds = xg_reg.predict(X_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse))#RMSE: 164.866642 r2 = r2_score(Y_test, preds) # Plot the first tree xgb.plot_tree(xg_reg,num_trees=0) plt.show() In the following code we apply xgboost to a regression type problem using linear based learners. ##################################### # XGBoost Linear Regression ##################################### DM_train = xgb.DMatrix(data=X_train, label=Y_train) DM_test = xgb.DMatrix(data=X_test, label=Y_test) params = {"booster":"gblinear", "objective":"reg:linear"} xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10) preds = xg_reg.predict(DM_test) rmse = np.sqrt(mean_squared_error(Y_test, preds)) print("RMSE: %f" % (rmse)) #RMSE: 169.731848 r2 = r2_score(Y_test, preds) Like many other algorithms, performance can be enhanced by tuning the hyperparameters. Below shows an example of a xgboost grid search. Grid search can be quite computationally expensive as we exhaustively search over a given set of hyperparameters, and pick the best performing hyperparameters. For example, if we have 2 hyperparameters to tune and 4 possible values for each parameter, that’s 16 possible parameter configurations. An alterative to grid search is random search, where you can define how many models/iterations to try before stopping. During each iteration, the algorithm randomly selects a value in the range specified for each hyperparameter. import pandas as pd import xgboost as xgb import numpy as np from sklearn.model_selection import GridSeachCV housing_data = pd.read_csv("ames_housing_trimmed_processed.csv") X, y = housing_data[housing_data.columns_tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]] housing_dmatrix = xgb.DMatrix(data=X, label=y) gbm_param_grid = {'learning_rate':[0.01,0.1,0.5,0.9], 'n_estimators': [200], 'subsample': [0.3,0.5,0.9]} gbm = xgb.Regressor() grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 4, verbose = 1) grid_mse.fit(X,y) print("Best parameters found: ", grid_mse.best_params_) print("lowest RMSE: ", np.sqrt(np.abs(grid_mse.best_score_))) A quick overview of the hyperparameters that can be tuned for tree based models: - eta/learning rate (how quickly the model fits residual error using additional base lase learners -gamma; minimum loss reduction to create new tree split -lambda: L2 regularisation on lead weights -alpha: L1 regularisation on leaf weights -max depth: how big a tree can grow -subsample: Percentage of sample that can be used for any given boosting round -colsample_tree: the fraction of features that can be called on during any boosting round (ranges from 0-1) An overview of hyperparameters that can be tuned for linear learners: -lambda: L2 regularisation on weights -alpha: L1 regularisation on weights -lambda_bias: L2 regularisation term on bias Another useful blog posts related to xgboost can be found here . Happy experimenting!
  • I have had the opportunity to speak at various DataScience MeetUp events in Nottingham, Loughborough and London. Typically, rather than use the usual PowerPoint presentation, I prefer running LIVE codes as this gives the audience the confidence that my codes work and is repeatable. The challenge with this for me however is that I tend to do most of my preparation at the office and end up using my personal laptop for the eventual presentation. This means that I run a 'office computer' - github repo - personal computer back and forth triangle. Not until a colleague in the office introduced me to azure notebooks (thanks Darren!). Microsoft Azure Notebooks is a free service that provides Jupyter Notebooks along with supporting packages for R, Python and F#. Using this notebooks is easy! All one needs is a free account at www.notebooks.azure.com. Azure notebooks uses libraries for grouping notebooks. For example, I now have a library for my MeetUp events based on location Once you are signed in and you've created a new library by clicking the '+New Library' button, you can use the '+New' button in the library environment to create a notebook by clicking the 'Item type' drop box. At the moment, it supports Python(2.7, 3.5, 3.6), R and F#. If you are fussy about organisation, it also allows you to create a folder instead and then create your files. In addition to creating new files, you can also load files from a URL or from your computer. That's not all! The notebook file on azure has a cool slide presentation feature called the RISE Slideshow. This RISE Slideshow is a notebook extension which allows you to use it for presentations. To enable Slide mode: In your notebook click View/Cell Toolbar/Slide Show For each cell select its type and hierarchy on the right hand side To start the presentation, click the "graph" icon (shown above) on the main toolbar.  Use left/right/up/down to navigate slides. This automatically turns your notebook into a slide presentation! The coolest thing about this azure notebooks is it's all on the cloud. This means that you can present using any computer/laptop (even if there's no python installed). And because of this, you do not need to pip install any library on the local machine. It's all done on the cloud Note that your packages will only be available for the lifetime of your notebook server and your notebook server will typically shutdown after 1 hour of inactivity. Azure Notebooks also lets you auto-setup your environment if you have a pip requirements file. There's so much to it and I'll just let you explore! I hope this is as useful for you as it was (and still is) for me. Let me know if you've used this before and what you like or do not like about it. For suggestions on other platforms especially for beginners to use for Python/R, see this post by Laura Da Silva