Azure Data Factory: The Comprehensive Guide

In this blog, we are going to study about the What is Azure data factory and what are the different use cases of Azure data factory (ADF) in Azure.

If you’re looking to get started with Azure Data Factory, this post is for you! In this tutorial, we’ll show you how to create your first data factory in Azure. 

We’ll also cover some of the basics of using ADF, including creating and managing pipelines and datasets.

Azure Data Factory is a cloud-based data integration service that allows you to easily create and manage data pipelines.

What is Azure Data Factory?

Azure Data Factory is a service offered by Microsoft Azure that allows you to create data-driven workflows in the cloud. 

It is a cloud-based data integration service that enables you to orchestrate and automate data transformation and data movement operations between different data stores and data processing services in the cloud. 

Azure Data Factory supports a wide range of data stores and data processing services, including Azure Storage, SQL Database, Azure HDInsight, and Azure Machine Learning.

These workflows can be used to orchestrate the movement and transformation of data. 

Azure Data Factory can be used to orchestrate data movement and transformations between on-premises data stores and Azure data stores, or between different Azure data stores.

Main Data Factory Components

There are six main components that we need to understand in order to create our projects, pipelines, activities, triggers, integration runtimes, datasets, and linked services.

Data Factory works the same way, except that it’s engineering data instead.

1. Activities

Activities represent each one of the steps performed on the data, which could be a copy, move, transform, enrich, etc.

Activities define actions to perform on the data such as data copy or transformation, Each activity might either consume or produce our dataset.

They will be the individual steps performed to handle and change the data, There are three kinds of activities in the Data Factory.

The first one is data movement, which, at this moment, means the copy activity on ADF. There’s no move activity on Data Factory, although it’s relatively simple to combine copy and delete operations to do that.

The copy activity can work with 87 different data stores at the time of this article publishing if you also consider the ones in preview.

Next, we have data transformation activities, which will perform some changes on the data, If you decide to use ADF for transformation, you add the mapping data flow activity.

Finally, data control activities define the logic that your pipeline is going to follow with options such as For Each, Set or Append variable, Until, and Wait.

It also allows you to invoke other pipelines or assess packages, But as with anything in Azure, all these activities and pipelines need to be executed by a compute resource.

2. Pipeline

Pipelines are the logical grouping of all activities needed to perform a unit of work to assemble the toy.

Pipelines are actually the work that you do on Data Factory, the same way that the work that you do in Excel is a spreadsheet.

These pipelines can be created graphically on the Data Factory site or programmatically through code.

Integration runtimes are the compute infrastructure of ADF, like the engine of the conveyor belt.

Triggers are the power switch and represent when you want this pipeline to execute, Do you want this pipeline to run Monday through Friday at 8 a.m.? Just create the trigger for it.

Finally, datasets are the representations of the data that we’re working with. As in the example that I have just mentioned, they would be the files or the SQL tables themselves.

Related Article: What is Data Lakes in Azure? – Guide on Azure Data Lake

3. Integration Runtimes (IR)

Integration Runtimes are the compute infrastructure of Azure Data Factory, which is needed to execute your activities. Integration runtimes are, in great part, what you actually pay for on ADF.

It provides four main capabilities, execution of Data Flow activities, execution of data movement activities such as the copy activity, dispatch and monitoring of activities to other computing environments such as HDInsight or Databricks, and SSIS package execution.

These tasks can be performed by one of the three types of runtimes on the Data Factory.

1. Azure Integration Runtime

The first one is the Azure Integration Runtime, which is used for data movement between publicly accessible endpoints.

These include cloud‑based resources such as AWS, GCP, or Azure itself, or Software as a Service resource such as Salesforce and SAP.

Mapping dataflows, which are the native way to transform data on Data Factory, are also executed here.

When you create a new ADF resource, an Azure IR called AutoResolveIntegrationRuntime is automatically created for you, and it’s recommended that you this IR whenever possible.

2. Self‑hosted IR

Next, we have the self‑hosted integration runtime, which is used to connect to private or on‑premise resources such as a SQL Server. The way this works is pretty interesting.

When you create and configure a self‑hosted runtime, you’re prompted to deploy software on a machine on your on‑premises environment.

This software will be responsible for communication with Data Factory and send you the data as needed through a regular HTTP connection.

3. Azure‑SSIS IR

Finally, there’s the Azure‑SSIS IR, which enables you to run SS packages on Data Factory, This is useful if you have done a lot of work on SSIS in the past and you want to migrate all of that to ADF.

When you create this runtime, you’re actually creating a virtual machine with SQL Server, which will be used solely to run SSIS packages.

Related Article: What Is Azure Databricks?

4. Linked Services

Linked services in datasets are like connection strings, so they basically tell you where to find the data.

They give you connection information to external resources, which can be of two types. The first are data stores, such as SQL Server, Azure SQL, Cosmos DB, and many, many more; 87 total if you consider the ones in preview.

The other linked service type represents an external compute resource so that ADF can know where to dispatch the request.

Think of linked services as connection strings.

This could reside on systems such as Azure ML, HDInsight, Databricks, SQL databases for stored procedures, and so on.

Linked services are more like the tray, They tell you where your data is. Is your data in files? A linked service will point you to the right blob storage or data lake, Is your data in SQL tables? Another linked service will give you the connection information to the SQL Server.

Related Article: What is Data Lakes in Azure? – Guide on Azure Data Lake

5. Dataset

Datasets, on the other hand, are more concerned with the data structures inside a data store. For example, describing that the name column is a string and that age is a number.

They are the representation of the data that you’re working with, Datasets depend on the linked services for connection information, and that’s why you need to also create a linked service.

For example, the linked service will point to the SQL database, whereas the dataset might point to a table, query, or stored procedure.

Likewise, the linked service might point to a network share or Blob storage, whereas the dataset might point to the files.

Ultimately, the dataset is what you’re going to use in your activities as data inputs and outputs.

Related Article: How to Perform ETL with Azure Databricks?

6. Triggers

Triggers are the Data Factory component that initiates the execution of a pipeline. Basically, they tell when a pipeline needs to run.

There are three types of triggers: schedule, tumbling window, and event‑based.

1. Schedule Triggers

Schedule triggers are ideal for when you’re running periodic packages; for example, once a day or every Sunday. The best way to memorize the schedule triggers is with the words ON and AT.

This trigger will run on Saturday at midnight, every day at 2:00AM, on the last day of the month, and so on.

2. Tumbling Window Triggers

Tumbling window triggers are best for periodic data processing, such as every 2 hours. For this trigger, the keyword to remember is every 10 minutes, every hour, and so on.

3. Event‑based Triggers

Finally, event‑based triggers are what the name says.

They’re fired based on an event. At the time of this recording though, they can only fire upon the creation or deletion of files on a Blob storage container.

It is possible that future versions of ADF will have more events, but that in itself is already an improvement in relation to SSIS, which didn’t have event‑based triggers. 

Related Article: Azure Synapse: The Future of Data Management

What are the Benefits of using Azure Data Factory?

Azure Data Factory offers a rich set of features and benefits that can help you solve various data integration problems. Some of the key benefits include:

1. Flexibility: 

Azure Data Factory offers a wide range of connectors and data transformation capabilities, so you can easily integrate data from a variety of sources.

2. Scalability: 

Azure Data Factory can handle large volumes of data, and can easily scale to meet your needs.

3. Efficiency: 

Azure Data Factory optimizes data processing and minimizes the time required to complete tasks.

4. Cost-effectiveness:

Azure Data Factory is a cost-effective way to integrate data and solve data problems.

How to Use ADF in your Business?

In order to use ADF in your business, you will need to understand the basics of how it works, ADF allows you to quickly and easily create a custom user interface for your web application. 

This means that you can design the look and feel of your application to match your specific needs. You can also use ADF to create custom business logic for your application. 

This gives you more control over how your application behaves and helps to ensure that it meets your specific needs.

Conclusion

In this blog you have learned how to create an ADF application in Azure, deploy it to Azure, and run it. You also learned how to use Azure Active Directory to manage authentication and authorization for your application.

ADF in Azure is a great way to get started with developing cloud-based applications, Here we showed you how to get started with Azure Data Factory. We hope you find it useful!

Related Article: Which Service Provides Serverless Computing in Azure?