Microsoft have now released a v2 of Data Factory. Though this is still in preview, it has the handy ‘Author and Deploy’ tool; this includes the copy activity wizard to assist creating a copy data pipeline. Most of this is the same as v1, however there are changes that have been introduced in this second iteration. I have had the fortune to be able to work with these changes and this blog is exactly about that. I will highlight the differences that Azure Data Factory v2 has brought in as of the time of writing.
1. Partitioning via a pipeline parameter – In v1, you could use the partitioning property and SliceStart variable to achieve partitioning. In v2 however, the way to achieve this behaviour is through the use of pipeline parameters, to achieve this you would need to do the following (this applies both when using the Copy Wizard and an ARM Template for the pipeline):
a. Define a pipeline parameter of type string.
b. Set folderPath in the dataset definition to the value of the pipeline parameter.
c. Pass a hardcoded value for the parameter before running the pipeline. Or, pass a trigger start time or scheduled time dynamically at runtime.
d. The below Git Gist outlines an example snippet of JSON code (of the above) from an Azure Resource Manager Template:
2. Custom Activity – In v1, to define a custom activity you had to implement the (custom) .NET Activity by creating a .NET Class Library project with a class that implements the Execute method of the IDotNetActivity interface. In Azure Data Factory v2, for a Custom Activity you are not required to implement a .NET interface. You can now directly run commands, scripts and your own custom code, compiled as an executable. To configure this implementation, you specify the command property together with the folderPath property. The Custom Activity will upload the executable and it’s dependencies to folderPath and execute the command for you. Linked Services, Data sets and Extended Properties defined in the JSON Payload of a Data Factory v2 Custom Activity can be accessed by your executable as JSON Files. Required Properties ca be accessed using a JSON Serialiser. To create an executable for a Custom Activity you need to:
a. Create a New Project in Visual Studio
b. Windows Desktop Application -> Console Application (.NET Framework). Be sure you target the .NET Framework and not .NET Core otherwise at build time a .exe will NOT be created.
c. Add in code files as needed including JSON files i.e. Linked Services etc.
d. Once done Build the project and then open the project folder \bin\<Debug or Release>\<MyProject>.exe
e. Upload the .exe file to Blob Storage in Azure (Make sure the executable is provided in the Azure Storage Linked Service Template). When uploading a custom activity executable to blob storage, be sure to upload All contents from the bin\Debug (or Release) folder, copying the entire folder to blob, otherwise the custom activity will fail as it will not be able to find any dependencies the application needs to run. Also, use subfolders when uploading custom activities. This makes it future proof in case further activities are added. For this I used Azure Storage Explorer which you can use to access the storage account and create the container and subsequent folders.
f. Create the pipeline in Data Factory v2 using Batch Service -> Custom.
g. Create a Batch account and pool (if not already created) and set up the pipeline as normal.
h. Trigger the run and test the pipeline.
Custom Activities are run in Azure Batch, so make sure the Batch Service meets the application needs. Whilst we are on the topic of Azure Batch Services; I would like to add a note here on how to monitor Azure Batch Services. To monitor custom activity runs in Azure Batch Service Pool or an Azure Batch Service run in general, use the tool Batch Labs. Once run, you can see the stderr.txt or stdout.txt file for the run details.
The biggest features that have helped me from ADF v2.0 as compared to v1.0 are the new control flow activities from which I used Chaining Activities the most. The biggest advantage of these are that you can define transformation activities without using datasets in V2. Control Flow activities allow for more flexible data pipeline models that are no longer tied down to time-series data, supporting diverse integration flows and patterns in the modern data warehouse. ADF v2 now allows for the following flows (which were not previously possible in v1):
Chaining Activities - In V1, you had to configure the output of an activity as an input of another activity to chain them. In V2, however, you can chain activities in a sequence within a pipeline using the dependsOn property in an activity definition to chain it with an upstream activity. Super useful!
Branching Activities - In V2, you can branch activities within a pipeline. The If-condition activity provides the same functionality that an if statement provides in programming languages. It evaluates a set of activities when the condition evaluates to true and another set of activities when the condition evaluates to false.
Parameters - You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-demand or from a trigger. Activities can consume the arguments that are passed to the pipeline.
Custom State Passing – One activity can consume the activity output of another activity (including the state) in the pipeline. Use the following syntax inside a JSON Definition of an activity to access the output of previous activities: @activity('NameofPreviousActivity').output.value. This feature allows you to build workflows in which values can pass through activities.
Looping Containers – This is a basically a ForEach Activity which iterates over a collection and runs specified activities in a loop. The behaviour of this control flow is like that of the ForEach loop in programming languages. The Until activity provides the same functionality that a do-until looping structure provides in programming languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to true. You can specify a timeout value for the until activity in Data Factory.
Trigger-based Flows – Pipelines can now be triggered by on-demand or wall clock time.
Invoking one pipeline from another pipeline – In V2, you can now invoke a pipeline from another pipeline using the Execute Pipeline Activity.
Delta Flows – Delta flows load only the data that has changed since the last iteration of the pipeline. New capabilities in V2, such as lookup activity, flexible scheduling, and control flow, enable this use case in a natural way.
Web Activity - Calls a custom REST endpoint from a Data Factory pipeline. You can pass datasets and linked services to be consumed and accessed by the activity.
Lookup Activity - Reads or looks up a record or table name value from any external source. This output can further be referenced by succeeding activities.
Get Metadata Activity - Retrieves the metadata of any data in Azure Data Factory.
Wait Activity - Pauses the pipeline for a specified period of time.