Azure Data Factory is a cloud based data orchestration tool that many ETL developers began using instead of SSIS. In this article, Rodney Landrum recalls a Data Factory project where he had to depend on another service, Azure Logic Apps, to fill in for some lacking functionality.

Through the term was not really in the vernacular as it is today, I have been a full or part-time “Data Engineer” my entire career. I have been quite comfortable with Microsoft ETL tools like SSIS for many years, dating back to the DTS days. My comfort with SSIS came from many years of trial and error via experimentation as well as adhering to the best practices put forth and tested by many of my colleagues in the SQL Server field. It was and still is a widely used and well-documented ETL platform. With the release of Azure Data Factory several years ago, though it was not touted as an SSIS replacement, many data engineers started working with and documenting this code-free or low-code orchestration experience, and I was one of them.

As with any technology, only with knowledge and experience will you be able to take advantage of all its key benefits, and by the same token will you uncover its severe limitations. On a recent assignment to build a complex logical data workflow in Azure Data Factory, that ironically had less “data” and more “flow” to engineer, I discovered not only benefits and limitations in the tool itself but also in the documentation that provided arcane and incomplete guidance at best. Some of the incomplete knowledge I needed was intrinsic to Azure Logic Apps, which I grant I had done very little with until this project, but it played a pivotal role as an activity called from the pipeline. I wanted to share a few pieces of this project with you here in hopes to bolster, however small, the available sources for quick insight into advanced challenges with ADF and to a lesser extent, Azure Logic Apps.

Specifically, I was asked to create a pipeline-driven workflow that sends approval emails with a file attachment and waits for the recipients to either approve, reject or ignore the email. If the approvers do not respond to the emails in the time frame defined by several variables like time of day and type of file, then a reminder email must be sent. Again, the recipients can approve, reject or ignore the reminder. Finally, a third email is sent to yet another approver with the same options. Ultimately the process will either copy the approved file to a secure FTP site after both of the initial two recipients or the final recipient approves the file, or it will send an email to the business saying the file was rejected. It may sound simple enough, even in a flow diagram; however, there were several head-scratchers and frustrated, lengthy ceiling stares that I may have easily avoided with a bit of foreknowledge.

The following are the four challenges I had to overcome to call the project a success:

When sending an approval email from Azure Logic Apps, which is initiated via a Webhook activity from the ADF pipeline, how do I force a response by a specific time of day? For example, if the initial emails must be approved or rejected by 9:00 AM, and it is triggered at 8:26 AM which is itself a variable time to start, how do I force the email to return control to the pipeline in 34 minutes?

The second challenge came with the Webhook activity itself. The Logic App needed to return status values back to the calling pipeline. While there was some minimal documentation that explained that a callbackURI was needed in an HTTP Post from within the Logic App, what I found informationally lacking was how to actually pass back values.

The third challenge was processing a rejection. The logic stated that if either of the initial approvers rejected the file, then the pipeline needed to stop further processing immediately and notify the business so a secondary file may be created and run through the workflow again. If the two initial emails to approvers were set to timeout after 34 minutes with no response (following the example above) and one of the approvers rejected the file in 3 minutes, the pipeline could not dilly dally for another 31 minutes spinning cycles waiting for the other approver.

Finally, each step in the process needed to be written to a logging table in an Azure SQL Database. That was not too difficult as it was a simple matter of passing dynamic values to a parameterized stored procedure. However, the number of times this needed to happen brought a much-unexpected consequence to my attention.

#bi #cloud development #homepage #sql prompt #data-science

Azure Data Factory pipelines: Filling in the gaps
1.25 GEEK