May 15, 2018

Integrating Machine Learning into your .NET applications

0 comments

 

There were a lot of great announcements from Build this year, from Visual Studio Live Share to Azure Sphere and everything in between. One thing that I was particularly excited by was the announcement of the preview of ML .NET, a cross-platform, open source machine learning platform for .NET, allowing those writing .NET applications to build, train and deploy machine learning models into their applications without having to necessarily call out to external APIs or libraries written in other languages running on the same system. This current preview release allows for classification and regression models to be built and used, but also brings with it a first draft of a set of APIs for training models and the core components of the framework, meaning that popular libraries such as TensorFlow and CNTK can be included in future. The process for creating a training pipeline is pretty straight forward and I've spent some time trying it out by building a classification model for the Notting Tram twitter feed, trying to determine if a given tweet is a notification about delays, that the trams are running, marketing and a few others. The initial version isn't that accurate as I only had about 70 tweets to train it from which is really no-where near enough, so training is on-going. But it works and I can save out the trained model and load it back again later so I can use it without re-training again. The pipeline for training looks a bit like this.

var pipeline = new LearningPipeline
{
 new TextLoader<TweetData>(DataPath, true, "comma"),
 new Dictionarizer("Label"),
 new TextFeaturizer("Features", "TweetText"),
 new StochasticDualCoordinateAscentClassifier {FeatureColumn = "Features", LabelColumn = "Label"},
 new PredictedLabelColumnOriginalValueConverter {PredictedLabelColumn = "PredictedLabel"}
};

var model = pipeline.Train<TweetData, ClassificationPrediction>();

One of the things I had to work out from the couple of examples on-line is the use of the Dictionarizer, because training only works with numbers and not labels you first need to tell the pipeline to convert your labels in the training data to numbers and then to convert them back again afterwards using the PredictedLabelColumnOriginalValueConverter class. Once it's trained and you have a model you can then run predictions based on some new piece of data, in this case I made up a couple of fake tweets and fed them back in.

var tweets = new[]
{
 new TweetData
    {
        Id = "10001",
        Language = "en",
        TweetText = "We have a disruption to our service due to a fire in a building at the Royal Centre"
    },
 new TweetData
    {
        Id = "10001",
        Language = "en",
        TweetText = "Don’t forget anyone heading to watch Forest v Derby with a match/season ticket can take advantage of our £2 Event Ticket when you travel #TheTramWay"
    },
};

foreach (var tweet in tweets)
{
 var prediction = model.Predict(tweet);
 Console.WriteLine(tweet.TweetText);
 Console.WriteLine($" -- {prediction.PredictedLabel}");
}

Because I based these on some existing tweets I got the answers I was expecting to see, trying it with some others I completely made up didn't really give me the results I wanted but like I said, it was trained on about 70 tweets. The output from this bit of code is the following, where the tweet text is written back out and the predicted label is written below it.

We have a disruption to our service due to a fire in a building at the Royal Centre
  -- Delays
Don't forget anyone heading to watch Forest v Derby with a match/season ticket can take advantage of our £2 Event Ticket when you travel #TheTramWay
  -- Marketing

It's early days for ML .NET but it is pretty exciting and gives developers a great way of being able to integrate machine learning into applications. Maybe you want to provide sentiment analysis on customer comments in that call center application, predict prices in your purchasing tools or anywhere else where you want to enrich data for your users, it's worth having a look and giving this library a try out.

 

You can find the project over on Github.

New Posts
  • Something that comes up quite frequently when people start using Spark is "How can I filter my DataFrame using the contents of another DataFrame?". People with SQL experience will immediately look to trying to replicate the following. SELECT * FROM table_a a WHERE EXISTS (SELECT * FROM table_b b WHERE b.Id = a.Id) So how do you do this in Spark? Well, some people will try to use the Column.isin method which uses varargs, this is okay for a small set of values but if you have a couple of large DataFrames then it's less than optimal as each row needs to be evaluated against the list. So what's the other choice? We can use joins to do the same thing. There are 2 we can use, a SEMI JOIN which is equivalent to our above example of running EXISTS; the other is ANTI JOIN which is equivalent to a NOT EXISTS. Using the above example and keeping the table names as DataFrame names we could re-write this in Scala as: table_a.join(table_b, Seq("Id"), "left_semi") These 2 joins are unique in that they only return the output of the left DataFrame, without any content from the right DataFrame. So what does this look like in practice. Well using Azure Databricks we can quickly create some sample data to try them out. First lets create a couple of DataFrames. First lets runs a simple query to find heroes which have an arch-enemy. This uses the SEMI JOIN to keep records in the left DataFrame where there is a matching record in the right DataFrame. Now, lets have look for heroes who've been a little more active and have removed their arch-enemies (for now). This time we've used an ANTI JOIN to keep only those records in the left DataFrame where there are no matching records in the right DataFrame. You'll notice that in the examples the join condition uses the slightly longer form, that's because in this example the columns we're joining on have different names, and also because there is a column in both DataFrames which have the same name.
  • Recently I needed to deploy an Azure Data Lake Store - Gen 2 instance and thought I'd take the opportunity to use some custom ARM template functions . These aren't something you often see in the example templates but can be really useful if there's a complex expression which you find yourself writing repeatedly within a template. If, for instance, you routinely create resource names based on a prefix, unique name and a suffix then this could save you a few keystrokes. In essence you are simply parameterizing the expression as follows: In this way you can use this simpler expression where you would have previously used the more complex version. [namespace.function(parameter1, parameter2)] If you want to see what this looks like in a full template then checkout this simple ARM template I put together for creating a Data Lake Store - Gen 2 instance over on GitHub.
  • Documentation is not something people often spend time reading, or if they do then its to quickly find the one thing their after and then get out as quickly as possible, very similar to how I do my Christmas shopping. Sometimes it's worth spending time reading the documentation though as there can be some useful bits of information hidden in summary descriptions, links etc... One such item is the Azure Data Lake Store client. If you find yourself reading or writing a lot of files and your doing it in multiple tasks (or threads, but you should be using Tasks if possible), then reading the docs can really help you out. For instance this snippet taken from the description at the top of the documentation page . If an application wants to perform multi-threaded operations using this SDK it is highly recomended to set ServicePointManager.DefaultConnectionLimit to the number of threads application wants the sdk to use before creating any instance of AdlsClient. By default ServicePointManager.DefaultConnectionLimit is set to 2. Okay, so how bad can things be if you don't read this? Well, to answer that I created an ADLS instance and uploaded a number of small parquet files. Then wrote an application to read each file (using the excellent Parquet .NET ) and return the number of records in the file, each file is processed in it's own Task and each uses the same AdlsClient instance. The simple process being followed here is to get a list of files, call " ProcessPath " on each and then when all the files have been process output the results. The output of this initial version is as follows: It's not too bad, but with multiple tasks I would have expected it to be better. Looking at the documentation snippet above it suggests we need to change the ServicePointManager.DefaultConnectionLimit value, but what to? Well doing some digging around came across a suggestion from Microsoft Support which, for ASP.NET, is to limit the number of requests that can execute at the same time to 12 per CPU (or 12 per core). So let's give that a go and see what happens. The code change for this is pretty simple and we can use System.Environment to get the number of processors available. So does it make much of a difference? Well, yes, quite a lot of difference actually. I ran the code in both variations a few more times to check it wasn't intermittent networking issues, other processes on my laptop interfering etc... but no, it really does make that much of a difference. So next time you're working with multiple tasks sharing resources, maybe spend a bit of time reading the documentation to see if there's anything which can make a difference to your application.