As a data scientist parallel computing may seem alien to us, as we're usually not from computing backgrounds, the idea of optimising our code for performance can seem quite a daunting task. However, if you are running into performance problems whilst using R, perhaps trying to train multiple machine learning models at once, then learning how to parallelize you code could be hugely beneficial.
In this post I aim to show you that parallelization can be achieved with relatively little difficulty and that significant performance boosts can be attained. In R, as usual, there is more than one way to skin a cat, with numerous packages available. I am going to introduce the foreach package, showing how it can be used with time-series cross-validation to achieve significant performance improvements over a standard implementation with lapply.
Time series cross-validation aims to use the whole data set for training and testing; if we have a particularly long time-series or want to test multiple models this could be very time consuming. The code below shows how multiple ARIMA(p,d,q) models can be tested in parallel. The time-series I used was 2 years worth of daily data, meaning that for each ARIMA model there were almost 700 training and test splits.
The results show considerable performance boost for the foreach solution, with the lapply solution taking 3.4 X as long.
This is the time taken to carry out time-series cross-validation on 27 different ARIMA(p,d,q) models.
I hope that this little example has shown you that it is certainly worth the effort to learn about parallel computing. The wins can be quite easily had, especially for problems like the above, which are what are known as 'embarassingly parallel'.
Further reading: -
More in depth understanding of parallelization and why it's not always the right thing to do - here