Microsoft have now released a v2 of Data Factory. Though this is still in preview, it has the handy ‘Author and Deply’ tool; this includes the copy activity wizard to assist create a copy data pipeline. Most of this is the same as v1, however there are changes that have been introduced in this second iteration; I have had the fortune to be able to work with these changes and this blog is exactly about that. I will highlight the differences that Azure Data Factory v2 has brought in (as of the time of writing this), so I wouldn’t be wrong in saying that further changes and difference would most likely be on their way too. I am assuming here that anyone reading this blog has prior experience of using data factory – The following are the differences:

  1. Partitioning via a pipeline parameter – In v1, you could use the partitioning property and SliceStart variable to achieve partitioning. In v2 however, the way to achieve this behaviour is to do the following actions (This applies both when using the Copy Wizard and an ARM Template for the pipeline):
    1. Define a pipeline parameter of type string.
    2. Set folderPath in the dataset definition to the value of the pipeline parameter.
    3.  Pass a hardcoded value for the parameter before running the pipeline. Or, pass a trigger start time or scheduled time dynamically at runtime.
    4. Here is an example of the above from an Azure Resource Manager Template:
      “typeProperties”: {
                              “format”: {
                                  “type”: “ParquetFormat”
                              },
                              “folderPath”: {
      “value”:”@concat(‘/test/’, formatDateTime(adddays(pipeline().TriggerTime,0), ‘yyyy’), ‘/’, formatDateTime(adddays(pipeline().TriggerTime,0), ‘MM’), ‘/’, formatDateTime(adddays(pipeline().TriggerTime,0), ‘dd’))”,
      “type”: “Expression”
      },
      “partitionedBy”: [
      {
      “name”: “Year”,
      “value”: {
      “type”: “DateTime”,
      “date”: “SliceStart”,
      “format”: “yyyy”
      }
      },
      {
      “name”: “Month”,
      “value”: {
      “type”: “DateTime”,
      “date”: “SliceStart”,
      “format”: “MM”
      }
      },
      {
      “name”: “Day”,
      “value”: {
      “type”: “DateTime”,
      “date”: “SliceStart”,
      “format”: “dd”
      }
      }
      ]
                          },
  2. Custom Activity – In v1, to define a custom activity you had to implement the (custom) DotNet Activity by creating a .NET Class Library project with a class that implements the Execute method of the IDotNetActivity interface. In Azure Data Factory v2, for a Custom Activity you are not required to implement a .NET interface. You can now directly run commands, scripts and your own custom code, compiled as an executable. To configure this implementation, you specify the command property together with the folderPath property. The Custom Activity will upload the executable and it’s dependencies to folderPath and execute the command for you. Linked Services, Data sets and Extended Properties defined in the JSON Payload of a Data Factory v2 Custom Activity can be accessed by your executable as JSON Files. Required Properties ca be accessed using a JSON Serialiser. To create an executable for a Custom Activity you need to:
    1. Create a New Project in Visual Studio
    2. Windows Desktop Application -> Console Application (.NET Framework). Be sure you target the .NET Framework and not .NET Core otherwise at build time a .exe will NOT be created.
    3. Add in code files as needed including JSON files i.e. Linked Services etc.
    4. Once done Build the project and then open the project folder \bin\<Debug or Release>\<MyProject>.exe
    5. Upload the .exe file to Blob Storage in Azure (Make sure the executable is provided in the Azure Storage Linked Service Template). When uploading a custom activity executable to blob storage, be sure to upload All contents from the bin\Debug (or Release) folder. Just copy the entire folder to blob otherwise the custom activity will fail, as it will not be able to find any dependencies the application needs to run. Also, use subfolders when uploading custom activities. This makes it future proof in case further activities are added. Best practice for this is to use Azure Storage Explorer in which you can access the storage account and create the container and subsequent folders. This can’t be done directly in Azure because blob is a flat structure, so the concept of folders is none existent for it. However, in Storage Explorer the ‘/’ creates a pseudo hierarchy in the blob, making it a virtual folder.
    6. Create the pipeline in Data Factory v2 using Batch Service -> Custom.
    7. Create a Batch account and pool (if not already created) and set up the pipeline as normal.
    8. Trigger the run and test the pipeline.

Custom Activities run in Azure Batch, so make sure the Batch Service meets the application needs. Whilst we are on the topic of Azure Batch Services; I would like to add a note here on how to monitor Azure Batch Services. To monitor custom activity runs in Azure Batch Service Pool or an Azure Batch Service run in general, use the tool Batch Labs. Once run, you can see the stderr.txt or stdout.txt file for the run details.

%d bloggers like this: