In this blog, we are going to study about the What is Azure data factory and what are the different use cases of Azure data factory (ADF) in Azure.
If you’re looking to get started with Azure Data Factory, this post is for you! In this tutorial, we’ll show you how to create your first data factory in Azure.
We’ll also cover some of the basics of using ADF, including creating and managing pipelines and datasets.
What is Azure Data Factory?
It is a cloud-based data integration service that enables you to orchestrate and automate data transformation and data movement operations between different data stores and data processing services in the cloud.
Azure Data Factory supports a wide range of data stores and data processing services, including Azure Storage, SQL Database, Azure HDInsight, and Azure Machine Learning.
Azure Data Factory can be used to orchestrate data movement and transformations between on-premises data stores and Azure data stores, or between different Azure data stores.
Main Data Factory Components
There are six main components that we need to understand in order to create our projects, pipelines, activities, triggers, integration runtimes, datasets, and linked services.
Data Factory works the same way, except that it’s engineering data instead.
1. Activities
Activities define actions to perform on the data such as data copy or transformation, Each activity might either consume or produce our dataset.
They will be the individual steps performed to handle and change the data, There are three kinds of activities in the Data Factory.
The first one is data movement, which, at this moment, means the copy activity on ADF. There’s no move activity on Data Factory, although it’s relatively simple to combine copy and delete operations to do that.
The copy activity can work with 87 different data stores at the time of this article publishing if you also consider the ones in preview.
Next, we have data transformation activities, which will perform some changes on the data, If you decide to use ADF for transformation, you add the mapping data flow activity.
Finally, data control activities define the logic that your pipeline is going to follow with options such as For Each, Set or Append variable, Until, and Wait.
It also allows you to invoke other pipelines or assess packages, But as with anything in Azure, all these activities and pipelines need to be executed by a compute resource.
2. Pipeline
Pipelines are actually the work that you do on Data Factory, the same way that the work that you do in Excel is a spreadsheet.
These pipelines can be created graphically on the Data Factory site or programmatically through code.
Integration runtimes are the compute infrastructure of ADF, like the engine of the conveyor belt.
Triggers are the power switch and represent when you want this pipeline to execute, Do you want this pipeline to run Monday through Friday at 8 a.m.? Just create the trigger for it.
Finally, datasets are the representations of the data that we’re working with. As in the example that I have just mentioned, they would be the files or the SQL tables themselves.
Related Article: What is Data Lakes in Azure? – Guide on Azure Data Lake
3. Integration Runtimes (IR)
It provides four main capabilities, execution of Data Flow activities, execution of data movement activities such as the copy activity, dispatch and monitoring of activities to other computing environments such as HDInsight or Databricks, and SSIS package execution.
These tasks can be performed by one of the three types of runtimes on the Data Factory.
1. Azure Integration Runtime
The first one is the Azure Integration Runtime, which is used for data movement between publicly accessible endpoints.
These include cloud‑based resources such as AWS, GCP, or Azure itself, or Software as a Service resource such as Salesforce and SAP.
Mapping dataflows, which are the native way to transform data on Data Factory, are also executed here.
When you create a new ADF resource, an Azure IR called AutoResolveIntegrationRuntime is automatically created for you, and it’s recommended that you this IR whenever possible.
2. Self‑hosted IR
Next, we have the self‑hosted integration runtime, which is used to connect to private or on‑premise resources such as a SQL Server. The way this works is pretty interesting.
When you create and configure a self‑hosted runtime, you’re prompted to deploy software on a machine on your on‑premises environment.
This software will be responsible for communication with Data Factory and send you the data as needed through a regular HTTP connection.
3. Azure‑SSIS IR
Finally, there’s the Azure‑SSIS IR, which enables you to run SS packages on Data Factory, This is useful if you have done a lot of work on SSIS in the past and you want to migrate all of that to ADF.
When you create this runtime, you’re actually creating a virtual machine with SQL Server, which will be used solely to run SSIS packages.
Related Article: What Is Azure Databricks?
4. Linked Services
They give you connection information to external resources, which can be of two types. The first are data stores, such as SQL Server, Azure SQL, Cosmos DB, and many, many more; 87 total if you consider the ones in preview.
The other linked service type represents an external compute resource so that ADF can know where to dispatch the request.
This could reside on systems such as Azure ML, HDInsight, Databricks, SQL databases for stored procedures, and so on.
Linked services are more like the tray, They tell you where your data is. Is your data in files? A linked service will point you to the right blob storage or data lake, Is your data in SQL tables? Another linked service will give you the connection information to the SQL Server.
Related Article: What is Data Lakes in Azure? – Guide on Azure Data Lake
5. Dataset
They are the representation of the data that you’re working with, Datasets depend on the linked services for connection information, and that’s why you need to also create a linked service.
For example, the linked service will point to the SQL database, whereas the dataset might point to a table, query, or stored procedure.
Likewise, the linked service might point to a network share or Blob storage, whereas the dataset might point to the files.
Ultimately, the dataset is what you’re going to use in your activities as data inputs and outputs.
Related Article: How to Perform ETL with Azure Databricks?
6. Triggers
There are three types of triggers: schedule, tumbling window, and event‑based.
1. Schedule Triggers
Schedule triggers are ideal for when you’re running periodic packages; for example, once a day or every Sunday. The best way to memorize the schedule triggers is with the words ON and AT.
This trigger will run on Saturday at midnight, every day at 2:00AM, on the last day of the month, and so on.
2. Tumbling Window Triggers
Tumbling window triggers are best for periodic data processing, such as every 2 hours. For this trigger, the keyword to remember is every 10 minutes, every hour, and so on.
3. Event‑based Triggers
Finally, event‑based triggers are what the name says.
They’re fired based on an event. At the time of this recording though, they can only fire upon the creation or deletion of files on a Blob storage container.
It is possible that future versions of ADF will have more events, but that in itself is already an improvement in relation to SSIS, which didn’t have event‑based triggers.
Related Article: Azure Synapse: The Future of Data Management
What are the Benefits of using Azure Data Factory?
Azure Data Factory offers a rich set of features and benefits that can help you solve various data integration problems. Some of the key benefits include:
1. Flexibility:
Azure Data Factory offers a wide range of connectors and data transformation capabilities, so you can easily integrate data from a variety of sources.
2. Scalability:
Azure Data Factory can handle large volumes of data, and can easily scale to meet your needs.
3. Efficiency:
Azure Data Factory optimizes data processing and minimizes the time required to complete tasks.
4. Cost-effectiveness:
Azure Data Factory is a cost-effective way to integrate data and solve data problems.
How to Use ADF in your Business?
In order to use ADF in your business, you will need to understand the basics of how it works, ADF allows you to quickly and easily create a custom user interface for your web application.
This means that you can design the look and feel of your application to match your specific needs. You can also use ADF to create custom business logic for your application.
This gives you more control over how your application behaves and helps to ensure that it meets your specific needs.
Conclusion
In this blog you have learned how to create an ADF application in Azure, deploy it to Azure, and run it. You also learned how to use Azure Active Directory to manage authentication and authorization for your application.
ADF in Azure is a great way to get started with developing cloud-based applications, Here we showed you how to get started with Azure Data Factory. We hope you find it useful!
Related Article: Which Service Provides Serverless Computing in Azure?
Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.