Tales from the Spark Summit: Exploring Apache Nifi

I managed to meet up with Andrew Morgan at the Spark Summit who I met a few years ago when he was a regular at Data Science London and working on a time series predictive model called Trend Calculus which looked pretty fantastic.

He has moved onto a new project which allows you to predict the news stories that people will be interested in reading. The machine learning components are very interesting and new and he built them with Apache Spark, hence his talk at the summit last week. I will probably do another post this month on the higher level content of the talk but I got a couple of copies of his new book called Mastering Spark for Data Science written with three other authors and I was working through the examples.

A few years ago I looked at Apache Nifi and got a basic dataflow going but Chapter 2 is dedicated to building a Nifi pipeline to pull data from a feed that the authors have created.

Each pipeline is a container called a "Process Group" and this particular pipeline centers around an HTTP request that returns a file manifest and unzips files. All activities can branch on success or failure so a network can emerge that can become fairly complex. One interesting feature is the ability to get the provenance of each input and output file through the pipeline and many properties and attributes are passed through the pipeline as a "flow file" so you can get the details at each stage.

I have multiple concurrent flows so that you can in one direction find and replace text and build up a metadata repository in Elastic Search whilst unzipping and putting the raw files into Azure Blob Storage and a data disk attached to the Nifi VM.

Lastly I was playing around with the ability to wire up Azure Data Lake Store. Not quite there but nearly. ADLS is a web-HDFS compliant, scalable store so if you deploy Hadoop on the VM and copy the core-site.xml with a configuration from HDInsight it should work. Nearly does for me!

Happy trails.