Building a Dimensional Data Warehouse Using dbt

Nam Huynh Thien
10 min readAug 5, 2023

--

Introduction

Today, I am thrilled to embark on a fascinating project centered around utilizing the powerful data build tool (dbt) to create a data warehouse for a thriving business.

The role of an Analytics Engineer has gained prominence recently, combining the skills of data analysts and data engineers. As an enthusiast of this field, I aim to expand my knowledge of analytics engineering technologies and techniques to enhance my abilities in serving business needs.

While I have some experience working with dbt, I admit that there are aspects of its technicalities and data warehouse design that I am yet to grasp fully. This project is a perfect opportunity to delve into the depths of dbt and discover the toolkit of data warehouse solutions.

Dataset: Northwind Trader

The Northwind Trader dataset, created by Microsoft for training and demonstration purposes, presents an array of sales data for a fictitious specialty foods export-import company. It encompasses sale transactions between Northwind Traders and its customers, as well as purchase transactions between Northwind and its suppliers.

Scenario

Modernizing Northwind Trader’s Infrastructure

Northwind Traders currently rely on a mix of on-premise and legacy systems, with their primary operational database being MySQL. The existing architecture supports sales transactions recording between the company and its customers, as well as generating reports and analytics solutions. However, the increasing reporting requirements are causing database slow-downs, affecting day-to-day business operations.

Hence, the decision to modernize the infrastructure arises from the need for better scalability, reduced load on operational systems, improved reporting speed, and enhanced data security.

Proposed Solution: Migrating to Google Cloud Platform

To address these challenges, I propose migrating the existing solutions to the Google Cloud platform. We will move the on-premise MySQL to Google Managed Services Cloud SQL to streamline the reporting solution. Additionally, we will build a Dimensional Data Warehouse on Google Cloud Platform using BigQuery, making it the main OLAP solution to support all reporting requirements. This approach will follow the principles of Kimball’s approach, a widely-used dimensional modeling technique.

Requirement Gathering: Defining Business Processes

Throughout the interview process with the business and stakeholders, we have identified several critical reporting requirements:

  • Sale Overview: Comprehensive sales reports to understand customer preferences, top-selling products, and underperforming items, providing a general overview of the business’s performance.
  • Sales Agent Tracking: Monitoring sales and performance of each sales agent to optimize commissions, recognize high achievers, and support underperforming agents.
  • Product Inventory: Analyzing current inventory levels, improving stock management, and negotiating better deals with suppliers based on purchase patterns.
  • Customer Reporting: Empowering customers with insights into their purchase history, enabling data-driven decisions and aligning their sales data with our reports. This report will also aid the customer care and marketing teams in understanding customer behavior for targeted campaigns.

Data Profiling: Understanding the System and Data

In this phase, we perform data profiling to gain insights into the system and data. Running queries helps us understand data redundancies, complexities, and table relationships. This step aids in forming the conceptual model and expediting the design solutions.

Example, we see in the customer table that we have 30 records

Customer Table

So it could be easy that we just write a simple query statement that looking the distinct count of ID column and see the result

The result returned 29 so this seems to be a duplicate. And it basically highlighted a problem that our customer table will need to handle during staging layer data transformation.

After we understand the customer table, we have to repeat the same process for all of these tables that we think needed to build dimensional data models.

Because there are many tables here. The easiest way I found is to go back to an ERD diagram again and to really quickly identify the tables which we think we would need. So by looking at it quickly, I can definitely tell that customer is a very important table because without customers we really have nothing because we want to understand about customers and the customer has a relationship to orders tables.

Business Matrix and Conceptual Modeling

So based on the information, we are going to create a high level business process identified earlier.This is high conceptual level and after define that business process, this will be a great start for modeling

After that, we create a high-level conceptual model that outlines fact tables and dimension tables. This conceptual model will lay the foundation for subsequent modeling steps.

So based on the information, we are going to create a high level business process identified earlier.

Architecture Design

The reference architecture for this analytic engineering project is designed to leverage the power of Google Cloud Platform (GCP) while ensuring an efficient and scalable data infrastructure.

  1. Data Sources and Migration: The project assumes that the data team has successfully migrated the on-premise Online Transaction Processing (OLTP) MySQL solution to Google Cloud SQL. This migration ensures that data from the original sources is now available on GCP for further processing.
  2. BigQuery Data Lake: The raw data from the migrated MySQL database is loaded into the BigQuery Data Lake layer. This data lake essentially replicates the MySQL OLTP data sources, creating a centralized repository for all raw data. By having a data lake, we avoid querying the data source directly, reducing the load on the transactional database system and improving overall performance.
  3. Staging Layer: From the Data Lake, the data is directed to the staging layer. Here, data transformation and cleansing take place. This crucial step ensures that the data is cleaned, standardized, and optimized for the subsequent modeling processes.
  4. Dimensional Data Warehouse Layer: After the staging layer, the cleaned and transformed data is used to build the dimension and fact models in the dimensional data warehouse layer. This step is vital for creating an efficient and organized data warehouse, facilitating effective analytics and reporting.
  5. One Big Table (OBT): To streamline reporting and connect data to Business Intelligence (BI) tools, I propose creating an OBT. The OBT acts as a consolidated table that combines all relevant data from different sources. This approach simplifies data retrieval, speeds up reporting processes, and showcases the versatility of various modeling styles within the data solution.

By following this reference architecture, the project aims to modernize Northwind Trader’s data and reporting solution effectively. Leveraging GCP’s capabilities and utilizing dbt for data transformation, the dimensional data warehouse will provide Northwind Traders with valuable insights and actionable business intelligence. Furthermore, the creation of an OBT will enable faster data access and empower the company’s decision-making processes.

Dimensional Modeling

The Dimensional Modeling phase is a critical step in the project, as it lays the foundation for organizing data in a way that supports efficient and effective analytics. This phase builds upon the conceptual model but includes additional details to refine the data structure.

During this phase, I engage in discussions with the business stakeholders to understand their priorities. Based on these discussions, we identify that the “Sales Overview” and “Customer” hold high priority, as they provide valuable insights into customer behavior and overall sales performance. The “Sales Agent” holds medium priority, as it helps track individual sales performance and adjust commissions accordingly. Lastly, the “Product Inventory”, while essential, is assigned the lowest priority among the dimensions.

With the architecture diagram in mind, outlining the flow from Google Cloud SQL to the data lake, staging layer, dimensional data warehouse layer, and finally the OBT layer, I now create a target mapping document. This document clarifies how data will be transformed and mapped at each stage, ensuring a smooth and organized data flow throughout the process.

With a thorough understanding of the source-to-target mapping and the priority of each business processes, I am well-equipped to construct the logical model. The logical model will represent the high-level structure of the data warehouse, showing the relationships between the fact tables and the associated dimension tables. It will serve as a blueprint for the data warehouse design.

Logical Model

Building the logical model involves referencing the business metrics once again and starting with the fact table. The fact table represents the core business metrics or key performance indicators (KPIs) that will be used for analysis and reporting. For each fact table, all the associated dimensions are identified and connected. This process ensures that the data is organized in a way that facilitates meaningful analysis and insights.

Physical Model

With the logical model in place, the next step is to create the physical design model and develop the dimensional data warehouse. In this phase, the logical model is translated into a concrete implementation, considering the actual database schema and structures.

The source-to-target mapping document serves as a valuable reference during the creation of the physical model. It guides the setup of the data warehouse, ensuring that the data flows smoothly from the source systems to the staging layer, dimensional data warehouse layer, and finally to the OBT layer for BI tool integration.

DBT for Data Transformation

During the physical model creation, DBT takes center stage as the main data transformation tool. DBT is instrumental in handling the data transformation process, including data cleansing, aggregations, and calculations. Leveraging DBT streamlines the development of the data warehouse layer, making it easier to implement the logical model efficiently.

At this stage, all the necessary requirements for building the data warehouse have been addressed. The physical model is now in place, reflecting the logical model’s design and the insights gathered from the source-to-target mapping document.

The dbt project is a robust combination of SQL files and YAML files, serving as the backbone of data transformation. At the core of the project is the “dbt_project.yml” file, which acts as the central configuration hub. It contains vital project details and essential information about the database and its settings.

I will go through this by introducing a little about folders in dbt project

Folder Structure:

Within the dbt project, various folders facilitate efficient data transformation:

  • Macros: These versatile pieces of reusable code resemble functions in programming languages. They empower us to keep SQL code DRY (Don’t Repeat Yourself). The inclusion of Jinja, a templating language within the “macros” folder, enhances the dynamism of our SQL code, eliminating static elements and promoting code flexibility.
  • Models: The heart of data transformation resides in the “models” folder. Here, SQL files come into play, defining how data will be transformed, cleansed, and aggregated. Leveraging dbt’s flexibility, we can meticulously structure and order our models, ensuring a streamlined and systematic approach to data transformation.
  • Snapshot: For managing Slowly Changing Dimensions (SCD), the “snapshot” folder proves invaluable. SCDs are dimensions with data that changes over time. dbt’s “snapshot” feature enables us to handle these changes efficiently, maintaining data integrity.
  • Tests: The “tests’’ folder embodies data validation through assertions. With dbt’s testing capabilities, we can validate assumptions about data and SQL files. There are two testing approaches available — singular testing, which involves SQL queries returning failing records, and genetic testing, facilitating repeated tests such as null value checks or uniqueness validation.

As you can see, I already built the staging layer, warehouse layer, and OBT for the reporting layer. After running, this will return results as below

And you will see in the Google BigQuery will have 3 layers like this.

Gitlab Repository

To ensure seamless collaboration and version control, I have meticulously organized and pushed all code and resources related to this project on GitLab. The repository is available at the following link: https://gitlab.com/namhuynh.ftu/analytics-engineer-project

Conclusion

This dbt project represents a pivotal step in modernizing Northwind Traders’ data and reporting solution. By harnessing the power of dbt’s data transformation capabilities, we are well on our way to delivering actionable business intelligence and empowering data-driven decision-making for the enterprise. The dimensional data warehouse will provide an invaluable foundation for enhanced scalability, improved reporting speed, and heightened data security, setting the stage for a successful future in the realm of analytics and business intelligence.

--

--

Nam Huynh Thien
Nam Huynh Thien

Written by Nam Huynh Thien

A guy who in love with Data Engineer, Data Science.

Responses (2)