12 min readJun 26, 2023

Since covid-19 around that time when I started to learn about data, I have been in love with data, its power and capabilities.

This is my blog where I am going to share my passion for unraveling the stories in the world of data engineering and analytics, where the possibilities are endless and the power of data knows no bounds.

As a highly motivated individual driven by a thirst for knowledge and a desire to showcase my skills to the world. My goal is to dive deep into the realm of data engineering and understand how it can be applied in real life

But why stop there? The allure of cloud technology has captivated me, as it offers the potential for faster, more accurate, and error-free data processing. With my previous experience as a BI/Data Analyst at HSV Group, I’ve had the privilege of working extensively with Microsoft’s suite of tools, including Power BI, Query, and Excel. This familiarity with their services makes Azure, Microsoft’s cloud platform, feel like a natural fit for expanding my skill set.

Now, let’s talk about Microsoft Fabric, a new release product for Data Enthusiasm . Even though I’ve only explored its preview version, I can already see its immense power. This comprehensive service brings together several robust components, such as Data Factory, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehouse, and Synapse Analytics. In the past where they were separate and now, everything is consolidated into one service, unleashing the full potential of data engineering.

It’s for these reasons that I firmly believe Microsoft Fabric is a highly competitive tool for data engineers, data scientists, and data analysts alike. But before we conquer this software-as-a-service (SaaS) marvel, let’s start by understanding its first component: Azure Data Factory.

I. SOLUTION ARCHITECTURE

And we are going to build an ETL pipeline in Azure Data Factory using Covid-19 Data.

This is architecture, we are going through this project

We have these components:

Storage Solution: Azure Blob Storage, Azure Data Lake Storage Gen2, Azure SQL Database,
Ingestion Solution: Azure Data Factory
Transformation Solution: Data flow in Azure Data Factory, HDInsight, Azure Databrick
Visualization Solution: Power BI

And everything will be in the same resource group (covid-reporting-rg) so they can interact with each other.

II. SUMMARY ABOUT TECHSTACK IN ARCHITECTURE

I used ADF for data integration because it provides connectors to all our data sources in this project. And also it provides a vast array of connectors that could be used to expand the project in the future when a new requirement arises. Similarly, Data Factory has the connectors to orchestrate our workflow. For example, it can orchestrate the workflow with running transformations in HDInsight and Azure Databrick.

I’ve used three transformation technologies here, Data Flow within Data Factory, HDInsight, and Azure Databrick. I’m using all three just to show you the capabilities of Azure Data Factory. We can use any one of these to meet all our needs in this project.

All three of these technologies are run on distributed infrastructure and they’re easily scalable. The difference is that Data Flow gives you a code free transformation tool which makes it easy to develop and maintain the transformation.

But at the same time, it is good for simple and medium level complexity transformations, but it lacks in its ability to develop complex transformations. The other two options, HDInsight and Databrick, will require you to write the code in one of Spark supported languages such as Python, Scholar or Spark SQL. HDInsight also gives you the ability to write code in a SQL-like language called Hive and also a scripting language called Pig.

However, for this particular project, I won’t be exploring the use of HDInsight and Azure Databrick,.. Instead, I will dedicate a separate project specifically to delve into how we can effectively utilize these technologies for complex transformations that require more coding skills

I. DATA INGESTION

1. Environment Setup

Because we have like 4 core components, we have to set up before starting the project.

Storage Solution: we have azure blob storage and Azure Data Lake Storage Gen 2

a) Azure Blob Storage

Step 1: Navigate to azure portal and search “Storage Account”

Step 2: Click First Storage Account to set Up Azure Blob Storage and Create

Step 3: Choose your subscription and Create a new Resource Group. In my case, Resource Group is “covid-reporting-rg”. For instance DETAIL, because I already have an account called “namhuynhcovidrepotingsa”, Region: Southeast Asia.

And Because the scope of this project does not require wide range of copy data so I choose “Locally-redundant storage (LRS)”

Step 4: With Networking, Data Protection, Encryption, Tags, Review, you can leave the rest as default.

b. Azure Data Lake Storage Gen2

You will do the same to create Azure Data Lake Storage Gen2.

*NOTE: When you in the “ADVANCED” tab, you have to tick “Hierarchical NameSpace” to activate Data Lake Storage Gen2. Now you can leave the rest as default.

c. Azure SQL Database

Step 1: Navigate to Search Bar, Search and find Azure SQL Database and then click “Create”

Step 2: With resource group created as before, So SQL database will be in this resource group.

And then you specify the name of database, in my case I already have named for it called “namhuynh-covid-reporting-db”, and server is “namhuynh-covid-reporint-server” .

In case you don’t have a server, click new and create it.

You can leave the rest as default

*NOTE: In my case I do small project to show how we can use Azure Data Factory, so for this purpose I will configure the compute and storage in SQL database to fit with that (optimize the cost)

Click “Configure database”

And change the Service Tier from “General Purpose — Serverless” into “Basic (For less demanding workloads”.
For Backup storage redundancy, I choose “Locally-redundant backup storage” for my purpose in this project

Now you can leave everything as default.

d. Azure Data Factory

Step 1:

In the search bar, type “Data Factory” and select “Data Factory” from the search results and click Create

Step 2:

And then I configure my azure Data Factory as follow

Subscription: Choose my azure subscription (Azure subscription 1)
Resource group: Select covid-reporting-rg
Name: “namhuynh-covidreporting-adf”
Region: Southeast Asia
Version: “V2”

You can configure your own Data Factory using different Resources, Name, Region and Version.

And Leave the “Git Configuration” tab as default or skip it if you want to configure Git integration later.

In the “Networking” tab, leave the settings as default.

In the “Monitoring” tab, leave the settings as default.

Step 4: Review the summary of your Data Factory configuration in the “Review + create” tab. Double-check the settings to ensure they match your requirements.

Click the “Create” button to start the deployment of your Azure Data Factory. The deployment process may take a few minutes.

Once the deployment is complete, you will see a notification indicating that the Data Factory has been successfully created.

e. Dashboard

Finally, I pin all these resource into a dashboard to easily monitor and interact whenever I open Azure Portal

The dashboard looks like this

2. Data Ingestion from Azure Blob into Azure Data Lake Gen2

As previously mentioned, our requirement is to ingest the population data set from Azure Blob Storage into our data lake. In this module, we will accomplish this using Azure Data Factory.

I have a population_by_age.tsv.gz file that will be stored in Azure Blob Storage. We want this data copied into Azure Data Lake Storage whenever the file arrives and after that auto deletes the gz file in Blob Storage.

Keep that in mind, now we start doing in Azure Data Factory

We have a gz file stored in Azure Blob Storage and we need that data flow into Azure Data Lake Gen2.

Using Azure Data Factory, we should prepare 2 datasets, 2 link services so that Azure Data Factory (1 for Azure Blob Storage, 1 for Azure Data Lake Gen2) then we use a pipeline doing the copy activity to copy data from Blob Storage into Data Lake Storage Gen2.

2.1. Ingest from Blob Storage into Data Lake

This is my pipeline call “pl_ingest_population_data”. First, it checks if the gz file exists and then get file metadata, which I choose is column_counts and size. And if the column_counts matches (13 columns in total) it will do the copy activity and then delete the file from the source which is stored in Azure Blob Storage.

2.2. Ingest from ECDC data into Data Lake

This pipeline, I call “pl_ingest_ecdc_data”.

There are four datasets that need to be ingested into Data Lake (cases_deaths.csv, hospital_admission.csv, testing.csv, cuntry_response.csv). Because I don’t want to ingest it manually. So I put all information about sourceBaseURL, sourceRelativeURL and sinkFile into a json file. Using Lookup Activity to look up data and for each information, using ForEach Activity to do replication of Copy Activity.

After click “Debug” the data should be in Azure Data Lake Storage Gen2.

Then we continue for Transformation Stage

II. DATA TRANSFORMATION.

This project mainly focuses on using Azure Data Factory, so I won’t include HDInsight and Azure Databrick as part of this Project.

First we need to look at transformation requirements for 4 pieces of data.

With case_and_deaths.csv, we have 9 columns in total, but we only keep 6 of it. Drop continent, and then pivot indicator by daily_count into new 2 columns as case_count, and death_count and do some rename for date.

Else we need to look up information about country_code_3_digit by using lookup to lookup from the CountryLookup File.

Using the dataflow, the transformation should look like this

For other remaining datasets, we use the same method and step according to the requirements for each dataset.

After using Dataflow to do transformation as required, we continue to use the pipeline to process data into our process folder in Azure Data Lake Storage Gen2.

With all utilities in the Activity tab, we can easily see dataflow. Drag it in then do the set up for transformed data.

I named this pipeline: “pl_process_cases_and_deaths_data”. Using the dataflow last time for the input.

And then Click “Debug”. And now, in Azure Data Lake Gen2 have transformed data of cases_and_deaths in process folder

And then I do the rest for other data left.

III. DATA WAREHOUSE

Now we move the transform data from the process folder of Azure Data Lake Storage Gen2 into SQL Database prepared before.

Still this but now the Source gonna be Azure Data Lake, and the Sink is SQL Database

We have to prepare a dataset for the Sink and link service for SQL databases.

And we have to create table for these data. I already create 3 tables for cases_and_deaths, hospital_admissions_daily and testing.

Now then we do the copy Activity for each table as follow

When everything have move into SQL database, we can check if data are live in here or not

When data lives in a SQL database, it can serve for analytics purposes.

IV MAKING THE PIPELINE PRODUCTION READY

To begin with, our main objective is to achieve fully automated execution of pipelines, eliminating the need for manual triggers. We aim to run pipelines at regular intervals, triggered by events like new data arrivals. It is crucial for activities to execute only when their dependencies are fulfilled. For example, transformations should occur only after successful data ingestion. Additionally, we seek an easily monitorable solution for efficient issue resolution.

With these requirements in mind, let’s explore the options available within Data Factory to meet them. Data Factory allows us to chain activities, executing them based on the status of preceding activities, which we have already implemented. Furthermore, we can chain pipelines within a parent pipeline, triggering them based on the completion of previous pipelines, a feature we will utilize in an upcoming lesson.

Data Factory also supports creating dependencies between triggers, but currently only tumbling window triggers have this capability. Event-based or scheduled triggers cannot wait for a file arrival while depending on another trigger.

Lastly, there is an option to create custom solutions using SDKs like PowerShell, .NET, Python, or REST API to execute pipelines. However, this option should be considered as a last resort due to additional costs and potential complexities, as these solutions exist outside of Data Factory and may not be immediately evident to new team members.

Moving on to Data Factory itself, we currently have eight pipelines, seventeen datasets, and three data flows. To facilitate discussions, it is recommended to group pipelines into folders based on their purpose, such as ingestion, processing, and SQL-related pipelines. Similarly, datasets can be organized into folders according to their categories, such as raw, processed, and SQL datasets.

By grouping these objects, it becomes easier to refer to them.

To orchestrate these pipelines, we will create a parent pipeline and incorporate the ingestion and processing pipelines into it. A tumbling window trigger will be utilized for the ECDC data pipeline. In the next lesson, we will focus on building the parent pipeline and its corresponding trigger for the population data pipeline. Following that, we will address the ECDC data pipeline. Lastly, it is important to remember to publish the changes made in Data Factory to ensure they are saved.

I created 6 new triggers to trigger the pipeline and attach it to the execute pipeline then the data pipeline would be production ready.

In conclution, Azure Data Factory is a great tool for Data Engineer. This is a great components in Microsoft Fabric. So if you want to learn about Microsoft Fabric, I think Azure Data Factory would be a great start.

There are more about Azure Data Factory, CI/CD and monitoring, but I don’t include in this blog. But if you want to learn more, you can visit Ramesh Retnasamy course on Udemy. I will leave the link below so you can have further learning.

Thanks for reading.

Ramesh Retnasamy Udemy course about Azure Data Factory.