How to Perform ETL with Azure Databricks?

All the Key Points...

In this ETL tutorial, you are going to learn the term called ETL and the use of the ETL process using Azure Databricks.

A common problem that organization faces is how to work with all of the different sources and formats of data that they have coming in every single day.

Often where the data needs to reside for an application is not the same as where it’s collected from. And along with data being in different formats, it often needs refining and shaping before it is stored for use by consuming applications.

What is ETL?

This process of moving and refining data in the data pipeline is commonly known as extract, transform, and load or ETL.

And that’s what we’ll cover in this module, performing ETL, extract, transform, load operations, with Azure Databricks.

The basic use of ETL is to achieve the right data analytics. that you can do using data lake or Data warehousing concept within Azure Databricks.

This will be a hands‑on walk-through of extracting data from Azure Data Lake storage using Azure Databricks to ingest and transform the data and then finish offloading the data into an Azure SQL Data Warehouse.

What is an Extract, Transform, and Load (ETL) Process?

Extract, transform, and load (ETL) is a data pipeline used to extract data from several sources, transform that data based on business rules, and load it into a destination data store.

This makes way for your intelligent applications to access the transformed data they need in the format that they need it.

ETL tasks can be difficult to visualize and map out. That’s because they typically require manipulation of a variety of data sets and almost anything is possible.

What might seem like a few extract-transform-load (ETL) processes – the way data might be combined – requiring simple robotic procedures – could turn into a series of convoluted, extensive, and time-consuming projects.

What is Extraction?

Extraction process in etl using azure databricks

This can be Azure Blob Storage, Data Lake, Hadoop just to name a few different types of storage. Since you don’t want to change your raw data and it’s often more data than is needed for the end‑consumer applications, you’re going to extract it from its stores for the transformation process.

The process of extracting comes out of notebook calls requesting the data that they need.

This allows the process to be scheduled and interactive. And because Azure Databricks lives in Azure, you have easy access to all of your storage native and securely.

What is Transformation?

Transformation process in etl using azure databricks

Transformation in Azure Databricks involves processing the raw data into predictions and insights. Transformation activity executes in the Azure Databricks Apache clusters driven by Azure Databricks notebooks and jobs.

Often, data is temporarily stored in tables for use during the ETL process. I like to think of this part as to where you refine your data down to what you need for your intelligent applications and analytics dashboard.

The data transformation that takes place usually involves various operations, such as filtering and sorting, joining data together, cleaning, removing duplicates, and making sure that your data is valid.

And you can use your language of choice, Python, R, Scala, SQL, whatever you need to use you can use for this transformation process.

What is Loading?

Loading process in etl using azure databricks

Once the data is transformed, it needs to be placed into a database service for use by the consuming applications.

Options include Azure SQL Data Warehouse, Azure SQL, Cosmos DB, and more. You choose the database service that meets your needs as many different options are available.

From here, consumers can use the data as they need it for those applications and dashboards like Power BI.

Because the data has been transformed based on the needs of the consuming applications, the databases will have all the data you need and little that you don’t.

And what if you don’t get all of the data you need, then you just rerun the process by tweaking your notebooks and jobs to extract, transform, and load what you do need.

How does ETL Process work?

Now let’s take a look at ETL in action. In our scenario, we’re dealing with radio listener information that’s been gathered into a JSON file. This data needs to be placed into an Azure Storage account using Azure Data Lake so it can be used in our ETL process.

Once the data is available, Azure Databricks notebooks are created in Azure Databricks, which interact with the data store to extract the listener information.

Notebooks will then perform actions to refine the data down to the desired state for use by the consuming applications.

Once the data is transformed, it will be placed into an Azure SQL Data Warehouse for final use by the consuming applications. In order to get this scenario up and running, there are a number of prerequisites that need to be in place for the ETL process.

Basically, what we need to have in place are the things that you would have in your environment when you bring Azure Databricks in, things like your data stores where your initial raw data is held and where it’s going to be held once it goes through the transformation process. You have things like service principles.

How to Perform ETL using Azure Databricks?

A common problem that organization faces is how to work with all of the different sources and formats of data that they have coming in every single day.

This process of moving and refining data in the data pipeline is commonly known as extract, transform, and load or ETL.

And that’s what we’ll cover in this module, performing ETL, extract, transform, load operations, with Azure Databricks.

In this module, you’ll learn the basics of ETL. Then we’ll take a look at how we use those within Azure Databricks.

ETL in Databricks Notebook

And that’s the ETL process in Azure Databricks. Now you’re ready to start transforming your data through the data pipeline for use by your intelligent applications.

Again, notebooks provide the basis for all the work here. Through code in the notebooks, you’re going to be able to complete all of the pieces from extracting the data, transforming and refining it, and loading it into its final storage place.

Once you’ve worked out the bugs, the notebooks can be run interactively or scheduled based on the needs of the consuming application.

And remember, if you don’t get the results you need in the end, simply modify your queries and rerun all the notebooks.

Super easy to do, and it fits into the modern workflow of data scientists and business analysts. If you need further learning or information about the topics covered in this module, check out both sets of documentation on Azure Databricks at docs.microsoft.com and docs.azuredatabricks.net.

Both are great resources and will allow you to go deeper on any of the topics related to Azure Databricks.

ETL Process on Azure and Databricks Notebooks

Microsoft Azure Databricks is a cloud-hosted Apache Spark-based system that provides interactive data science environments.

It combines the speed of local development with the power of the cloud, providing real-time collaboration, interactive notebooks, and interactive analytics dashboards.

Databricks notebooks enable users to execute code in Python (or R), connect applications to SQL databases like Azure SQL Database or Hive on Azure HDInsight, or connect to REST APIs like Google Cloud Vision or AWS QLDB.

1. Data Extraction and Importing in Databricks

As Azure Databricks is a collaborative and interactive environment, you can import and export data from DataFrames and Datasets by using the Save user interface or REST APIs.

There are a few formats that you can export your data to CSV, JSON, SQL, Amazon Web Services Data Format (AWS DFT), Apache Hive, Amazon S3, Azure Blob Storage.

2. Loading Data from a Database and Excel file

If you need to upload data from an Excel or SQL database, you’ll need to tell Databricks the connection information.

Use the File Uploader button in the Data Import section of the QuickStart page. Make sure to load the Microsoft ODBC Driver for SQL Server if your data is stored in an SQL database (the driver can be downloaded here) and then enter your login credentials for that database.

At the bottom of the window, you must specify a destination connection string that points to a flat-file location on a remote server or a Microsoft Excel sheet on your computer, separated by backslashes.

3. Run SQL Queries for Transformation

Herewith the help of an example we will Run the SQL Queries on Excel datasheets and save the findings to a CSV file.

In this good example, a colleague of mine has a list of data from which he wants to grab the total size of potentially available storage from public-facing Azure disks by country.

To show you step-by-step, he took this list and ran MySQL queries on Excel sheets, and saved the findings to a CSV file.

The SQL code for this would be:

SELECT `country`, SUM(`database_size`) as `total_space` FROM `publicstorage` GROUP BY `country`

Conclusion

In this post, the process of ETL has been explained with Azure Databricks as seen from a step-by-step logical point of view.

The processes have been structured with appropriate diagrams. Now it is time for you to experiment and generate your own insights from the data at your disposal using a free trial of Azure Databricks with a fully managed service experience.

How to Perform ETL with Azure Databricks?

What is ETL?