Jan 19, 2018

Using R scripts in Azure Functions




Recently we've been working on how we could deploy models written in R into a production system, where they could be called as part of a data pipeline or on-demand. From a data engineering point-of-view one of the best tools in your toolbox when you want to execute relatively small pieces of code, without worrying about all the infrastructure or software platforms supporting it, is Azure Functions. Functions also have an extensive range of bindings supporting different input and output sources. As it stands today both R and Python are supported in Azure Functions but only experimentally with a limited number of bindings. There's a pretty good introduction over on Microsoft's Azure blog with supporting code over at Github. This is based on using PowerShell to provide a Timer Triggered function which executes an R script. Unfortunately, PowerShell is also currently experimental in Azure and has a limited number of bindings available to it, but it's a good place to start and gets up and running with R, which requires a site extension to be installed and some manual unzipping (the process isn't currently as smooth as it perhaps could be). This got us up and running and demonstrated that we could run R code through Azure Functions; but we wanted to use other types of bindings and to have something a little more reliable (and testable) than PowerShell. Azure Functions (v1) supports several languages, but the key ones are C# (C# or C# Script) and JavaScript. We ran into some early issues with C# as an option with the R.NET package and target architectures and so moved over to JavaScript to see if we could get something running a bit quicker. Initially we started working with Child Processes but soon came across a package called r-script which takes care of getting data into the R script, handling the response, and provides both synchronous and asynchronous methods. It's one of those neat little packages which just takes the mundane work away from you. One thing it does make available is an R method called "needs" which handles the installation and usage of packages in your script and is the recommended way of doing this using the library. So, what do we need to do to make it work? First we'll need to make sure the environment has all of the tools needed to developer R code and Azure Functions, this includes ensuring that you have NodeJS and NPM installed which I would recommend using a version manager for. Once this is set up and we've created the template for the function we'll need to ensure that r-script is able to locate the R executables in the path, one way of doing this is to ensure that the R_HOME variable is configured in the environment variables by either setting it manually, or better setting it using an application setting for the project. Once that's done then we can define the location of the executables in the code and append this to the path. Note that here I'm presuming a target OS of Windows.


You'll notice that I've also got another environment variable (application setting) defining where the R scripts are being kept. It's worth highlighting at this point that ensuring the use of application settings will help when you deploy to Azure as the R executables will typically be at a location such as "D:\home\R-3.3.3\bin\x64", you may not have that location available locally so using settings will make the code more portable and maintainable. Once that has been configured then we can simply call the R code using the methods provided by r-script and capturing the output. Make sure you have a read of the r-script documentation to see how to pass data in and get it back out.


We can then do any post-processing we need to with the output object and send it on it's way to the output location. In the case of an HTTP trigger this would be returning it back to the caller.


There are some important things that we encountered (and just generally) that it's worth mentioning here. Choosing the right plan for your Functions is crucial, in our investigations we found that running the R scripts when deployed using a Consumption Plan was significantly slower than when compared to it running using an App Service Plan (20+ seconds compared to 2 seconds in our case). Depending on how you want to test or deploy your function this may be fine, but it's something to keep in mind if you're encountering performance issues on a Consumption Plan. Get used to working in the Kudu console, it's incredibly useful for testing your R script and bits of JavaScript as well as viewing the streaming log files. Testing functions locally will save you a great deal of pain and will help debugging your functions.

New Posts
  • Something that comes up quite frequently when people start using Spark is "How can I filter my DataFrame using the contents of another DataFrame?". People with SQL experience will immediately look to trying to replicate the following. SELECT * FROM table_a a WHERE EXISTS (SELECT * FROM table_b b WHERE b.Id = a.Id) So how do you do this in Spark? Well, some people will try to use the Column.isin method which uses varargs, this is okay for a small set of values but if you have a couple of large DataFrames then it's less than optimal as each row needs to be evaluated against the list. So what's the other choice? We can use joins to do the same thing. There are 2 we can use, a SEMI JOIN which is equivalent to our above example of running EXISTS; the other is ANTI JOIN which is equivalent to a NOT EXISTS. Using the above example and keeping the table names as DataFrame names we could re-write this in Scala as: table_a.join(table_b, Seq("Id"), "left_semi") These 2 joins are unique in that they only return the output of the left DataFrame, without any content from the right DataFrame. So what does this look like in practice. Well using Azure Databricks we can quickly create some sample data to try them out. First lets create a couple of DataFrames. First lets runs a simple query to find heroes which have an arch-enemy. This uses the SEMI JOIN to keep records in the left DataFrame where there is a matching record in the right DataFrame. Now, lets have look for heroes who've been a little more active and have removed their arch-enemies (for now). This time we've used an ANTI JOIN to keep only those records in the left DataFrame where there are no matching records in the right DataFrame. You'll notice that in the examples the join condition uses the slightly longer form, that's because in this example the columns we're joining on have different names, and also because there is a column in both DataFrames which have the same name.
  • Recently I needed to deploy an Azure Data Lake Store - Gen 2 instance and thought I'd take the opportunity to use some custom ARM template functions . These aren't something you often see in the example templates but can be really useful if there's a complex expression which you find yourself writing repeatedly within a template. If, for instance, you routinely create resource names based on a prefix, unique name and a suffix then this could save you a few keystrokes. In essence you are simply parameterizing the expression as follows: In this way you can use this simpler expression where you would have previously used the more complex version. [namespace.function(parameter1, parameter2)] If you want to see what this looks like in a full template then checkout this simple ARM template I put together for creating a Data Lake Store - Gen 2 instance over on GitHub.
  • Documentation is not something people often spend time reading, or if they do then its to quickly find the one thing their after and then get out as quickly as possible, very similar to how I do my Christmas shopping. Sometimes it's worth spending time reading the documentation though as there can be some useful bits of information hidden in summary descriptions, links etc... One such item is the Azure Data Lake Store client. If you find yourself reading or writing a lot of files and your doing it in multiple tasks (or threads, but you should be using Tasks if possible), then reading the docs can really help you out. For instance this snippet taken from the description at the top of the documentation page . If an application wants to perform multi-threaded operations using this SDK it is highly recomended to set ServicePointManager.DefaultConnectionLimit to the number of threads application wants the sdk to use before creating any instance of AdlsClient. By default ServicePointManager.DefaultConnectionLimit is set to 2. Okay, so how bad can things be if you don't read this? Well, to answer that I created an ADLS instance and uploaded a number of small parquet files. Then wrote an application to read each file (using the excellent Parquet .NET ) and return the number of records in the file, each file is processed in it's own Task and each uses the same AdlsClient instance. The simple process being followed here is to get a list of files, call " ProcessPath " on each and then when all the files have been process output the results. The output of this initial version is as follows: It's not too bad, but with multiple tasks I would have expected it to be better. Looking at the documentation snippet above it suggests we need to change the ServicePointManager.DefaultConnectionLimit value, but what to? Well doing some digging around came across a suggestion from Microsoft Support which, for ASP.NET, is to limit the number of requests that can execute at the same time to 12 per CPU (or 12 per core). So let's give that a go and see what happens. The code change for this is pretty simple and we can use System.Environment to get the number of processors available. So does it make much of a difference? Well, yes, quite a lot of difference actually. I ran the code in both variations a few more times to check it wasn't intermittent networking issues, other processes on my laptop interfering etc... but no, it really does make that much of a difference. So next time you're working with multiple tasks sharing resources, maybe spend a bit of time reading the documentation to see if there's anything which can make a difference to your application.